Nowadays, in many different fields, massive data are available and for several reasons, it might be convenient to analyze just a subset of the data. The application of the D-optimality criterion can be helpful to optimally select a subsample of observations. However, it is well known that D-optimal support points lie on the boundary of the design space and if they go hand in hand with extreme response values, they can have a severe influence on the estimated linear model (leverage points with high influence). To overcome this problem, firstly, we propose a non-informative "exchange" procedure that enables us to select a "nearly" D-optimal subset of observations without high leverage values. Then, we provide an informative version of this exchange procedure, where besides high leverage points also the outliers in the responses (that are not necessarily associated to high leverage points) are avoided. This is possible because, unlike other design situations, in subsampling from big datasets the response values may be available. Finally, both the non-informative and informative selection procedures are adapted to I-optimality, with the goal of getting accurate predictions.

Accounting for outliers in optimal subsampling methods / L. Deldossi, E. Pesce, C. Tommasi. - In: STATISTICAL PAPERS. - ISSN 1613-9798. - 64:4(2023), pp. 1119-1135. [10.1007/s00362-023-01422-3]

Accounting for outliers in optimal subsampling methods

C. Tommasi
Co-ultimo
Membro del Collaboration Group
2023

Abstract

Nowadays, in many different fields, massive data are available and for several reasons, it might be convenient to analyze just a subset of the data. The application of the D-optimality criterion can be helpful to optimally select a subsample of observations. However, it is well known that D-optimal support points lie on the boundary of the design space and if they go hand in hand with extreme response values, they can have a severe influence on the estimated linear model (leverage points with high influence). To overcome this problem, firstly, we propose a non-informative "exchange" procedure that enables us to select a "nearly" D-optimal subset of observations without high leverage values. Then, we provide an informative version of this exchange procedure, where besides high leverage points also the outliers in the responses (that are not necessarily associated to high leverage points) are avoided. This is possible because, unlike other design situations, in subsampling from big datasets the response values may be available. Finally, both the non-informative and informative selection procedures are adapted to I-optimality, with the goal of getting accurate predictions.
D-optimality; I-optimality; Active learning; Subsampling
Settore SECS-S/01 - Statistica
2023
Article (author)
File in questo prodotto:
File Dimensione Formato  
StatisticalPaper_1.pdf

accesso aperto

Descrizione: Regular Article
Tipologia: Publisher's version/PDF
Dimensione 668.52 kB
Formato Adobe PDF
668.52 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1015628
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact