The goal of simultaneous feature selection and outlier detection is to determine a sparse linear regression vector by fitting a dataset possibly affected by the presence of outliers. The problem is well-known in the literature. In its basic version it covers a wide range of tasks in data analysis. Simultaneously performing feature selection and outlier detection strongly improves the application potential of regression models in more general settings, where data governance is a concern. To trigger this potential, flexible training models are needed, with more parameters under control of decision makers. The use of mathematical programming, although pertinent, is scarce in this context and mostly focusing on the least-squares setting. Instead we consider the least absolute deviation criterion, proposing two mixed-integer linear programs, one adapted from existing studies, and the other obtained from a disjunctive programming argument. We show theoretically and computationally that the disjunctive-based formulation is better in terms of both continuous relaxation quality and integer optimality convergence. We experimentally benchmark against existing methodologies from the literature. We identify the characteristics of contamination patterns, in which mathematical programming is better than state-of-the-art algorithms in combining prediction quality, sparsity and robustness against outliers. Additionally, the mathematical programming approaches allow the decision maker to directly control parameters like the number of features or outliers to tolerate, those based on least absolute deviations performing best. On real world datasets, where privacy is a concern, our approach compares well to state-of-the-art methods in terms of accuracy, being at the same time more flexible.
Mathematical programming for simultaneous feature selection and outlier detection under l1 norm / M. Barbato, A. Ceselli. - In: EUROPEAN JOURNAL OF OPERATIONAL RESEARCH. - ISSN 0377-2217. - 316:3(2024), pp. 1070-1084. [10.1016/j.ejor.2024.03.035]
Mathematical programming for simultaneous feature selection and outlier detection under l1 norm
M. Barbato
Primo
;A. CeselliUltimo
2024
Abstract
The goal of simultaneous feature selection and outlier detection is to determine a sparse linear regression vector by fitting a dataset possibly affected by the presence of outliers. The problem is well-known in the literature. In its basic version it covers a wide range of tasks in data analysis. Simultaneously performing feature selection and outlier detection strongly improves the application potential of regression models in more general settings, where data governance is a concern. To trigger this potential, flexible training models are needed, with more parameters under control of decision makers. The use of mathematical programming, although pertinent, is scarce in this context and mostly focusing on the least-squares setting. Instead we consider the least absolute deviation criterion, proposing two mixed-integer linear programs, one adapted from existing studies, and the other obtained from a disjunctive programming argument. We show theoretically and computationally that the disjunctive-based formulation is better in terms of both continuous relaxation quality and integer optimality convergence. We experimentally benchmark against existing methodologies from the literature. We identify the characteristics of contamination patterns, in which mathematical programming is better than state-of-the-art algorithms in combining prediction quality, sparsity and robustness against outliers. Additionally, the mathematical programming approaches allow the decision maker to directly control parameters like the number of features or outliers to tolerate, those based on least absolute deviations performing best. On real world datasets, where privacy is a concern, our approach compares well to state-of-the-art methods in terms of accuracy, being at the same time more flexible.File | Dimensione | Formato | |
---|---|---|---|
16_articolo_math_prog_for_sfsod_EJOR_Michele_Barbato.pdf
accesso aperto
Descrizione: Versione finale
Tipologia:
Publisher's version/PDF
Dimensione
1.04 MB
Formato
Adobe PDF
|
1.04 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.