Accounting for outliers in optimal subsampling methods

Deldossi, L.; Pesce, E.; Tommasi, C.

doi:10.1007/s00362-023-01422-3

Nowadays, in many different fields, massive data are available and for several reasons, it might be convenient to analyze just a subset of the data. The application of the D-optimality criterion can be helpful to optimally select a subsample of observations. However, it is well known that D-optimal support points lie on the boundary of the design space and if they go hand in hand with extreme response values, they can have a severe influence on the estimated linear model (leverage points with high influence). To overcome this problem, firstly, we propose a non-informative "exchange" procedure that enables us to select a "nearly" D-optimal subset of observations without high leverage values. Then, we provide an informative version of this exchange procedure, where besides high leverage points also the outliers in the responses (that are not necessarily associated to high leverage points) are avoided. This is possible because, unlike other design situations, in subsampling from big datasets the response values may be available. Finally, both the non-informative and informative selection procedures are adapted to I-optimality, with the goal of getting accurate predictions.

Accounting for outliers in optimal subsampling methods / L. Deldossi, E. Pesce, C. Tommasi. - In: STATISTICAL PAPERS. - ISSN 1613-9798. - 64:4(2023), pp. 1119-1135. [10.1007/s00362-023-01422-3]

Accounting for outliers in optimal subsampling methods

Deldossi L.;C. Tommasi^{Co-ultimo

Membro del Collaboration Group}

2023

Abstract

Nowadays, in many different fields, massive data are available and for several reasons, it might be convenient to analyze just a subset of the data. The application of the D-optimality criterion can be helpful to optimally select a subsample of observations. However, it is well known that D-optimal support points lie on the boundary of the design space and if they go hand in hand with extreme response values, they can have a severe influence on the estimated linear model (leverage points with high influence). To overcome this problem, firstly, we propose a non-informative "exchange" procedure that enables us to select a "nearly" D-optimal subset of observations without high leverage values. Then, we provide an informative version of this exchange procedure, where besides high leverage points also the outliers in the responses (that are not necessarily associated to high leverage points) are avoided. This is possible because, unlike other design situations, in subsampling from big datasets the response values may be available. Finally, both the non-informative and informative selection procedures are adapted to I-optimality, with the goal of getting accurate predictions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				D-optimality; I-optimality; Active learning; Subsampling
			
	Settori scientifico-disciplinari dell'articolo (sola visualizzazione)
	
				Settore SECS-S/01 - Statistica
			
	Data di pubblicazione
	
				2023
			
	Rivista in ANCE
	
				STATISTICAL PAPERS
			
	DOI
	
				https://dx.doi.org/10.1007/s00362-023-01422-3
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
StatisticalPaper_1.pdf accesso aperto Descrizione: Regular Article Tipologia: Publisher's version/PDF Dimensione 668.52 kB Formato Adobe PDF Visualizza/Apri	668.52 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1015628

Citazioni

ND

1

3

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca