Big Data are generally huge quantities of digital information accrued automatically and/or merged from several sources and rarely result from properly planned population surveys. A Big Dataset is herein conceived as a collection of information concerning a finite population. Since the analysis of an entire Big Dataset can require enormous computational effort, we suggest selecting a sample of observations and using this sampling information to achieve the inferential goal. Instead of the design-based survey sampling approach (which relates to the estimation of summary finite population measures, such as means, totals, proportions) we consider the model-based sampling approach, which involves inference about parameters of a super-population model. This model is assumed to have generated the finite population values, i.e. the Big Dataset. Given a super-population model we can apply the theory of optimal design to draw a sample from the Big Dataset which contains the majority of information about the unknown parameters of interest. In addition, since a Big Dataset might provide poor information despite its size, from the definition of efficiency of a design we suggest a device to measure the quality of the Big Data.

Optimal Design of Experiments and Model-Based Survey Sampling in Big Data / L. Deldossi, C. Tommasi - In: Annual ENBIS Conference / [a cura di] J. Bischoff. - Budapest : Mathematical Institute of Eotvos Lorand University, Budapest, 2019. - ISBN 9789634891468. - pp. 37-37 (( Intervento presentato al 19. convegno Annual ENBIS Conference tenutosi a Budapest nel 2019.

Optimal Design of Experiments and Model-Based Survey Sampling in Big Data

C. Tommasi
2019

Abstract

Big Data are generally huge quantities of digital information accrued automatically and/or merged from several sources and rarely result from properly planned population surveys. A Big Dataset is herein conceived as a collection of information concerning a finite population. Since the analysis of an entire Big Dataset can require enormous computational effort, we suggest selecting a sample of observations and using this sampling information to achieve the inferential goal. Instead of the design-based survey sampling approach (which relates to the estimation of summary finite population measures, such as means, totals, proportions) we consider the model-based sampling approach, which involves inference about parameters of a super-population model. This model is assumed to have generated the finite population values, i.e. the Big Dataset. Given a super-population model we can apply the theory of optimal design to draw a sample from the Big Dataset which contains the majority of information about the unknown parameters of interest. In addition, since a Big Dataset might provide poor information despite its size, from the definition of efficiency of a design we suggest a device to measure the quality of the Big Data.
Finite-population sampling; Optimal design theory; Super-populatioin model; Tall data
Settore SECS-S/01 - Statistica
2019
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
11314_0675125314.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Dimensione 7.57 MB
Formato Adobe PDF
7.57 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
slides_Enbis_2019.pdf

accesso aperto

Descrizione: Slide presentate al convegno ENBIS 2019
Tipologia: Altro
Dimensione 1.17 MB
Formato Adobe PDF
1.17 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/697601
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact