IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

The increasing prevalence of huge datasets addresses the research to appropriate statistical methods for solving troubles caused by their complexity. On the one hand, several techniques are mentioned in the literature, especially for the time-consuming and variables reduction issues. On the other, less debate is devoted to the statistical inference issue. Indeed, a large number of involved statistical units may lead to wrongly consider as significant variables without any actual impact on the phenomenon under study. This paper suggests a suitable subsampling procedure for the reduction of the number of statistical units and provides a novel index for the assessment of the significance effects. The proposal is validated by comparing results obtained from the analysis on the original data to those obtained from the proposed subsampling approach. The illustrative application focuses on the educational dataset made available by the National Committee for the Evaluation of the Italian Education Systems (INVALSI). This dataset collects information about the student features and achievements in Maths within the lower secondary schools of the Lombardy region (Italy). Due to the hierarchical structure of the data, a multilevel model is implemented with the purpose of investigating the effects of both individual and school factors on student Maths score.

Dealing with the biased effects issue when handling huge datasets: the case of INVALSI data / E. Raffinetti, I. Romeo. - In: JOURNAL OF APPLIED STATISTICS. - ISSN 0266-4763. - 42:12(2015), pp. 2554-2570. [10.1080/02664763.2015.1043867]

Dealing with the biased effects issue when handling huge datasets: the case of INVALSI data

E. Raffinetti;I. Romeo

2015

Abstract

The increasing prevalence of huge datasets addresses the research to appropriate statistical methods for solving troubles caused by their complexity. On the one hand, several techniques are mentioned in the literature, especially for the time-consuming and variables reduction issues. On the other, less debate is devoted to the statistical inference issue. Indeed, a large number of involved statistical units may lead to wrongly consider as significant variables without any actual impact on the phenomenon under study. This paper suggests a suitable subsampling procedure for the reduction of the number of statistical units and provides a novel index for the assessment of the significance effects. The proposal is validated by comparing results obtained from the analysis on the original data to those obtained from the proposed subsampling approach. The illustrative application focuses on the educational dataset made available by the National Committee for the Evaluation of the Italian Education Systems (INVALSI). This dataset collects information about the student features and achievements in Maths within the lower secondary schools of the Lombardy region (Italy). Due to the hierarchical structure of the data, a multilevel model is implemented with the purpose of investigating the effects of both individual and school factors on student Maths score.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
			huge datasets; subsampling; significance effects; multilevel modeling; variable selection criterion
		
	Settori scientifico-disciplinari dell'articolo
	
			Settore SECS-S/01 - Statistica
		
	Data di pubblicazione
	
			2015
		
	Rivista in ANCE
	
			JOURNAL OF APPLIED STATISTICS
		
	DOI
	
			https://dx.doi.org/10.1080/02664763.2015.1043867
		
	Tipologia
	
			Article (author)
		
	Appare nelle tipologie:
	
			01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
Dealing with the biased effects issue when handling huge datasets the case of INVALSI data.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 593.57 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	593.57 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/277357

Citazioni

ND

3

3

social impact