The increasing prevalence of huge datasets addresses the research to appropriate statistical methods for solving troubles caused by their complexity. On the one hand, several techniques are mentioned in the literature, especially for the time-consuming and variables reduction issues. On the other, less debate is devoted to the statistical inference issue. Indeed, a large number of involved statistical units may lead to wrongly consider as significant variables without any actual impact on the phenomenon under study. This paper suggests a suitable subsampling procedure for the reduction of the number of statistical units and provides a novel index for the assessment of the significance effects. The proposal is validated by comparing results obtained from the analysis on the original data to those obtained from the proposed subsampling approach. The illustrative application focuses on the educational dataset made available by the National Committee for the Evaluation of the Italian Education Systems (INVALSI). This dataset collects information about the student features and achievements in Maths within the lower secondary schools of the Lombardy region (Italy). Due to the hierarchical structure of the data, a multilevel model is implemented with the purpose of investigating the effects of both individual and school factors on student Maths score.

Dealing with the biased effects issue when handling huge datasets: the case of INVALSI data / E. Raffinetti, I. Romeo. - In: JOURNAL OF APPLIED STATISTICS. - ISSN 0266-4763. - 42:12(2015), pp. 2554-2570. [10.1080/02664763.2015.1043867]

Dealing with the biased effects issue when handling huge datasets: the case of INVALSI data

E. Raffinetti
;
2015

Abstract

The increasing prevalence of huge datasets addresses the research to appropriate statistical methods for solving troubles caused by their complexity. On the one hand, several techniques are mentioned in the literature, especially for the time-consuming and variables reduction issues. On the other, less debate is devoted to the statistical inference issue. Indeed, a large number of involved statistical units may lead to wrongly consider as significant variables without any actual impact on the phenomenon under study. This paper suggests a suitable subsampling procedure for the reduction of the number of statistical units and provides a novel index for the assessment of the significance effects. The proposal is validated by comparing results obtained from the analysis on the original data to those obtained from the proposed subsampling approach. The illustrative application focuses on the educational dataset made available by the National Committee for the Evaluation of the Italian Education Systems (INVALSI). This dataset collects information about the student features and achievements in Maths within the lower secondary schools of the Lombardy region (Italy). Due to the hierarchical structure of the data, a multilevel model is implemented with the purpose of investigating the effects of both individual and school factors on student Maths score.
huge datasets; subsampling; significance effects; multilevel modeling; variable selection criterion
Settore SECS-S/01 - Statistica
2015
Article (author)
File in questo prodotto:
File Dimensione Formato  
Dealing with the biased effects issue when handling huge datasets the case of INVALSI data.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Dimensione 593.57 kB
Formato Adobe PDF
593.57 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/277357
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact