In contrast to the empirical mean, the Median-of-Means (MoM) is an estimator of the mean theta of a square integrable r.v. Z, around which accurate nonasymptotic confidence bounds can be built, even when Z does not exhibit a sub-Gaussian tail behavior. Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning, where it is used to design training procedures that are not sensitive to atypical observations. More recently, a new line of work is now trying to characterize and leverage MoM's ability to deal with corrupted data. In this context, the present work proposes a general study of MoM's concentration properties under the contamination regime, that provides a clear understanding of the impact of the outlier proportion and the number of blocks chosen. The analysis is extended to (multisample) U-statistics, i.e. averages over tuples of observations, that raise additional challenges due to the dependence induced. Finally, we show that the latter bounds can be used in a straightforward fashion to derive generalization guarantees for pairwise learning in a contaminated setting, and propose an algorithm to compute provably reliable decision functions.

Generalization Bounds in the Presence of Outliers: a Median-of-Means Study / P. Laforgue, G. Staerman, S. Clemencon (PROCEEDINGS OF MACHINE LEARNING RESEARCH). - In: International Conference on Machine Learning / [a cura di] M. Meila, T. Zhang. - [s.l] : JMLR-JOURNAL MACHINE LEARNING RESEARCH, 2021. - pp. 5937-5947 (( convegno International Conference on Machine Learning tenutosi a on line nel 2021.

Generalization Bounds in the Presence of Outliers: a Median-of-Means Study

P. Laforgue;
2021

Abstract

In contrast to the empirical mean, the Median-of-Means (MoM) is an estimator of the mean theta of a square integrable r.v. Z, around which accurate nonasymptotic confidence bounds can be built, even when Z does not exhibit a sub-Gaussian tail behavior. Thanks to the high confidence it achieves on heavy-tailed data, MoM has found various applications in machine learning, where it is used to design training procedures that are not sensitive to atypical observations. More recently, a new line of work is now trying to characterize and leverage MoM's ability to deal with corrupted data. In this context, the present work proposes a general study of MoM's concentration properties under the contamination regime, that provides a clear understanding of the impact of the outlier proportion and the number of blocks chosen. The analysis is extended to (multisample) U-statistics, i.e. averages over tuples of observations, that raise additional challenges due to the dependence induced. Finally, we show that the latter bounds can be used in a straightforward fashion to derive generalization guarantees for pairwise learning in a contaminated setting, and propose an algorithm to compute provably reliable decision functions.
Settore INF/01 - Informatica
2021
http://proceedings.mlr.press/v139/laforgue21a/laforgue21a.pdf
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
laforgue21a.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 5.55 MB
Formato Adobe PDF
5.55 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/922773
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 4
  • OpenAlex ND
social impact