The paper is a preliminary research report and presents a method for generating new records using an evolutionary algorithm (close to but different from a genetic algorithm). This method, called Pseudo-Inverse Function (in short P-I Function), was designed and implemented at Semeion Research Centre (Rome). P-I Function is a method to generate new (virtual) data from a small set of observed data. P-I Function can be of aid when budget constraints limit the number of interviewees, or in case of a population that shows some sociologically interesting trait, but whose small size can seriously affect the reliability of estimates, or in case of secondary analysis on small samples. The applicative ground is given by research design with one or more dependent and a set of independent variables. The estimation of new cases takes place according to the maximization of a fitness function and outcomes a number as large as needed of 'virtual' cases, which reproduce the statistical traits of the original population. The algorithm used by P-I Function is known as Genetic Doping Algorithm (GenD), designed and implemented by Semeion Research Centre; among its features there is an innovative crossover procedure, which tends to select individuals with average fitness values, rather than those who show best values at each 'generation'. A particularly thorough research design has been put on: (1) the observed sample is half-split to obtain a training and a testing set, which are analysed by means of a back propagation neural network; (2) testing is performed to find out how good the parameter estimates are; (3) a 10% sample is randomly extracted from the training set and used as a reduced training set; (4) on this narrow basis, GenD calculates the pseudo-inverse of the estimated parameter matrix; (5) 'virtual' data are tested against the testing data set (which has never been used for training). The algorithm has been proved on a particularly difficult ground, since the data set used as a basis for generating 'virtual' cases counts only 44 respondents, randomly sampled from a broader data set taken from the General Social Survey 2002. The major result is that networks trained on the 'virtual' resample show a model fit as good as the one of the observed data, though 'virtual' and observed data differ on some features. It can be seen that GenD 'refills' the joint distribution of the independent variables, conditioned by the dependent one.

GenD an evolutionary system for resampling in survey research / C. Meraviglia, G. Massini, D. Croce, M. Buscema. - In: QUALITY & QUANTITY. - ISSN 0033-5177. - 40:5(2006 Oct), pp. 825-859. ((Intervento presentato al 6. convegno RC33 Conference on Social Science Methodology tenutosi a Amsterdam nel 2004 [10.1007/s11135-005-3264-x].

GenD an evolutionary system for resampling in survey research

C. Meraviglia
;
2006

Abstract

The paper is a preliminary research report and presents a method for generating new records using an evolutionary algorithm (close to but different from a genetic algorithm). This method, called Pseudo-Inverse Function (in short P-I Function), was designed and implemented at Semeion Research Centre (Rome). P-I Function is a method to generate new (virtual) data from a small set of observed data. P-I Function can be of aid when budget constraints limit the number of interviewees, or in case of a population that shows some sociologically interesting trait, but whose small size can seriously affect the reliability of estimates, or in case of secondary analysis on small samples. The applicative ground is given by research design with one or more dependent and a set of independent variables. The estimation of new cases takes place according to the maximization of a fitness function and outcomes a number as large as needed of 'virtual' cases, which reproduce the statistical traits of the original population. The algorithm used by P-I Function is known as Genetic Doping Algorithm (GenD), designed and implemented by Semeion Research Centre; among its features there is an innovative crossover procedure, which tends to select individuals with average fitness values, rather than those who show best values at each 'generation'. A particularly thorough research design has been put on: (1) the observed sample is half-split to obtain a training and a testing set, which are analysed by means of a back propagation neural network; (2) testing is performed to find out how good the parameter estimates are; (3) a 10% sample is randomly extracted from the training set and used as a reduced training set; (4) on this narrow basis, GenD calculates the pseudo-inverse of the estimated parameter matrix; (5) 'virtual' data are tested against the testing data set (which has never been used for training). The algorithm has been proved on a particularly difficult ground, since the data set used as a basis for generating 'virtual' cases counts only 44 respondents, randomly sampled from a broader data set taken from the General Social Survey 2002. The major result is that networks trained on the 'virtual' resample show a model fit as good as the one of the observed data, though 'virtual' and observed data differ on some features. It can be seen that GenD 'refills' the joint distribution of the independent variables, conditioned by the dependent one.
evolutionary algorithm; resampling; neural networks
Settore SPS/07 - Sociologia Generale
Settore SECS-S/01 - Statistica
ott-2006
University of Amsterdam
International Sociological Association
ISA Research Committee 33 on Logic and Methodology
Article (author)
File in questo prodotto:
File Dimensione Formato  
GenD - final.pdf

accesso riservato

Descrizione: articolo principale
Tipologia: Publisher's version/PDF
Dimensione 557.84 kB
Formato Adobe PDF
557.84 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/296096
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 4
social impact