Automatic identification of variables in epidemiological datasets using logic regression

Lorenz, M.W.; Abdi, N.A.; Scheckenbach, F.; Pflug, A.; Bülbül, A.; Catapano, A.L.; Agewall, S.; Ezhov, M.; Bots, M.L.; Kiechl, S.; Orth, A.; Norata, G.D.; Empana, J.P.; Lin, H.; Mclachlan, S.; Bokemark, L.; Ronkainen, K.; Amato, M.; Schminke, U.; Srinivasan, S.R.; Lind, L.; Kato, A.; Dimitriadis, C.; Przewlocki, T.; Okazaki, S.; Stehouwer, C.D.A.; Lazarevic, T.; Willeit, P.; Yanez, D.N.; Steinmetz, H.; Sander, D.; Poppert, H.; Desvarieux, M.; Ikram, M..A.; Bevc, S.; Staub, D.; Sirtori, C.R.; Iglseder, B.; Engström, G.; Tripepi, G.; Beloqui, O.; Lee, M.; Friera, A.; Xie, W.; Grigore, L.; Plichart, M.; Su, T.; Robertson, C.; Schmidt, C.; Tuomainen, T.; Veglia, F.; Völzke, H.; Nijpels, G.; Jovanovic, A.; Willeit, J.; Sacco, R.L.; Franco, O.H.; Hojs, R.; Uthoff, H.; Hedblad, B.; Park, H.W.; Suarez, C.; Zhao, D.; Catapano, A.; Ducimetiere, P.; Chien, K.; Price, J.F.; Bergström, G.; Kauhanen, J.; Tremoli, E.; Dörr, M.; Berenson, G.; Papagianni, A.; Kablak Ziembicka, A.; Kitagawa, K.; Dekker, J.M.; Stolic, R.; Polak, J.F.; Sitzer, M.; Bickel, H.; Rundek, T.; Hofman, A.; Ekart, R.; Frauchiger, B.; Castelnuovo, S.; Rosvall, M.; Zoccali, C.; Landecho, M.F.; Bae, J.; Gabriel, R.; Liu, J.; Baldassarre, D.; Kavousi, M.

doi:10.1186/s12911-017-0429-1

Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Automatic identification of variables in epidemiological datasets using logic regression / M.W. Lorenz, N.A. Abdi, F. Scheckenbach, A. Pflug, A. Bülbül, A.L. Catapano, S. Agewall, M. Ezhov, M.L. Bots, S. Kiechl, A. Orth, G.D. Norata, J.P. Empana, H. Lin, S. Mclachlan, L. Bokemark, K. Ronkainen, M. Amato, U. Schminke, S.R. Srinivasan, L. Lind, A. Kato, C. Dimitriadis, T. Przewlocki, S. Okazaki, C.D.A. Stehouwer, T. Lazarevic, P. Willeit, D.N. Yanez, H. Steinmetz, D. Sander, H. Poppert, M. Desvarieux, M..A. Ikram, S. Bevc, D. Staub, C.R. Sirtori, B. Iglseder, G. Engström, G. Tripepi, O. Beloqui, M. Lee, A. Friera, W. Xie, L. Grigore, M. Plichart, T. Su, C. Robertson, C. Schmidt, T. Tuomainen, F. Veglia, H. Völzke, G. Nijpels, A. Jovanovic, J. Willeit, R.L. Sacco, O.H. Franco, R. Hojs, H. Uthoff, B. Hedblad, H.W. Park, C. Suarez, D. Zhao, A. Catapano, P. Ducimetiere, K. Chien, J.F. Price, G. Bergström, J. Kauhanen, E. Tremoli, M. Dörr, G. Berenson, A. Papagianni, A. Kablak Ziembicka, K. Kitagawa, J.M. Dekker, R. Stolic, J.F. Polak, M. Sitzer, H. Bickel, T. Rundek, A. Hofman, R. Ekart, B. Frauchiger, S. Castelnuovo, M. Rosvall, C. Zoccali, M.F. Landecho, J. Bae, R. Gabriel, J. Liu, D. Baldassarre, M. Kavousi. - In: BMC MEDICAL INFORMATICS AND DECISION MAKING. - ISSN 1472-6947. - 17:1(2017), pp. 40.1-40.11. [10.1186/s12911-017-0429-1]

Automatic identification of variables in epidemiological datasets using logic regression

M. W. Lorenz;N. A. Abdi;F. Scheckenbach;A. Pflug;A. Bülbül;A.L. Catapano;S. Agewall;M. Ezhov;M. L. Bots;S. Kiechl;A. Orth;G.D. Norata;J. P. Empana;H. Lin;S. Mclachlan;L. Bokemark;K. Ronkainen;M. Amato;U. Schminke;S. R. Srinivasan;L. Lind;A. Kato;C. Dimitriadis;T. Przewlocki;S. Okazaki;C. D. A. Stehouwer;T. Lazarevic;P. Willeit;D. N. Yanez;H. Steinmetz;D. Sander;H. Poppert;M. Desvarieux;M. . A. Ikram;S. Bevc;D. Staub;C. R. Sirtori;B. Iglseder;G. Engström;G. Tripepi;O. Beloqui;M. Lee;A. Friera;W. Xie;L. Grigore;M. Plichart;T. Su;C. Robertson;C. Schmidt;T. Tuomainen;F. Veglia;H. Völzke;G. Nijpels;A. Jovanovic;J. Willeit;R. L. Sacco;O. H. Franco;R. Hojs;H. Uthoff;B. Hedblad;H. W. Park;C. Suarez;D. Zhao;A. Catapano;P. Ducimetiere;K. Chien;J. F. Price;G. Bergström;J. Kauhanen;E. Tremoli;M. Dörr;G. Berenson;A. Papagianni;A. Kablak Ziembicka;K. Kitagawa;J. M. Dekker;R. Stolic;J. F. Polak;M. Sitzer;H. Bickel;T. Rundek;A. Hofman;R. Ekart;B. Frauchiger;S. Castelnuovo;M. Rosvall;C. Zoccali;M. F. Landecho;J. Bae;R. Gabriel;J. Liu;D. Baldassarre^Penultimo;M. Kavousi

2017

Abstract

Background: For an individual participant data (IPD) meta-analysis, multiple datasets must be transformed in a consistent format, e.g. using uniform variable names. When large numbers of datasets have to be processed, this can be a time-consuming and error-prone task. Automated or semi-automated identification of variables can help to reduce the workload and improve the data quality. For semi-automation high sensitivity in the recognition of matching variables is particularly important, because it allows creating software which for a target variable presents a choice of source variables, from which a user can choose the matching one, with only low risk of having missed a correct source variable. Methods: For each variable in a set of target variables, a number of simple rules were manually created. With logic regression, an optimal Boolean combination of these rules was searched for every target variable, using a random subset of a large database of epidemiological and clinical cohort data (construction subset). In a second subset of this database (validation subset), this optimal combination rules were validated. Results: In the construction sample, 41 target variables were allocated on average with a positive predictive value (PPV) of 34%, and a negative predictive value (NPV) of 95%. In the validation sample, PPV was 33%, whereas NPV remained at 94%. In the construction sample, PPV was 50% or less in 63% of all variables, in the validation sample in 71% of all variables. Conclusions: We demonstrated that the application of logic regression in a complex data management task in large epidemiological IPD meta-analyses is feasible. However, the performance of the algorithm is poor, which may require backup strategies.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
			Data management; Epidemiology; Logic regression; Meta-analysis; Health Policy; Health Informatics
		
	Settori scientifico-disciplinari dell'articolo
	
			Settore BIO/14 - Farmacologia
		
	Data di pubblicazione
	
			2017
		
	Rivista in ANCE
	
			BMC MEDICAL INFORMATICS AND DECISION MAKING
		
	DOI
	
			https://dx.doi.org/10.1186/s12911-017-0429-1
		
	Tipologia
	
			Article (author)
		
	Appare nelle tipologie:
	
			01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
28407816.pdf accesso aperto Descrizione: Articolo principale Tipologia: Publisher's version/PDF Dimensione 995.33 kB Formato Adobe PDF Visualizza/Apri	995.33 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/517529

Citazioni

1

1

1

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca