Dealing with systematically missing confounders in individual participant data meta-analysis : an application to the relationship between socioeconomic status and gastric cancer risk in the Stomach cancer Pooling (StoP) project

Rota, M.; Rumi, F.; Pelucchi, C.; Lunet, N.; Boccia, S.; Negri, E.; La Vecchia, C.

Introduction. Despite incidence and mortality have been falling in most areas of the world over the last several decades, there are still about one million new diagnoses of gastric cancer (GC) per year worldwide, and GC remains the third leading cause of cancer mortality. The earlier reduction of the burden of GC in high income countries - largely explained by downwards trends in the prevalence of Helicobacter pylori (H. pylori) infection and by improvements in diet and food conservation – led to neglect this neoplasm in terms of research and development efforts. However, the study of aetiological factors remains a global priority for the prevention and control for GC. The Stomach Cancer Pooling (StoP) Project is an international consortium of case-control (CC), including nested CC, studies on GC started in 2012 [1]. This consortium joins together scientists from several areas of the world aimed to study the role of lifestyle, dietary habits and genetic determinants in GC aetiology, through an individual participant data meta-analysis (IPD-MA). IPD-MA is considered to be a gold standard approach in evidence synthesis as allows access to raw data from each participant, data checking, verification and centralized recoding of data according to common definitions, thus minimizing publication and reporting bias typical of aggregated data meta-analysis [2]. At the time of writing, 31 case-control studies agreed to participate to the StoP project, for a total of more than 14,000 GC cases and 33,700 healthy controls (StoP dataset release 2). This data represented a unique opportunity to better quantify the association between alcohol drinking and tobacco smoking on the risk of GC [3, 4]. Moreover, it is an opportunity for medical statisticians to explore and apply IPD-MA related methods, and their drawbacks [5]. Among them, a hot discussed topic in the literature is the choice of the analysis method, i.e. one-stage or two-stage, and the management of systematically missing confounding factors, i.e. variables not collected in one or more studies of the IPD-MA. Aims. To give a brief overview of the one-stage and two-stage analyses methods, we applied some recently developed advanced techniques to address the problem of systematically missing variables in an IPD-MA. Among them, we dealt with i) a multivariate random effect model in a two-stage framework (hereafter called “Fully and Partially Adjusted Meta-Analysis” - FPAMA) [6] and ii) a one-step analysis based on Multiple Imputation (MI) by Chained Equations (MICE), hereafter called multilevel multiple imputation (MLMI) [7]. These methods have been applied within the StoP project to study the relationship between socioeconomic status and GC risk while allowing for adjustment of H. Pylori infection, a variable that has not been collected in some case-control studies of the StoP consortium. Methods. Statistical methods for IPD-MA should take into account the hierarchical structure of the data in which individuals are nested within studies. Two approaches have been described: a one-stage and a two-stage approach [8]. In the one-stage approach data are analysed in a single step through a random-effects model, while in the two-stage approach each study is analysed separately, and results combined through standard meta-analytic techniques. Several studies investigating differences between one-stage and two-stage IPD-MA either theoretically, empirically or via simulation concluded that “largely the two approaches produce similar results” [5,8]. In the presence of one, or more, systematically missing confounders, researchers could choose to carry out a complete data analysis by including only studies without systematically missing variables, or a complete confounder analysis including all the studies but adjusting only for those confounders available in all the studies. Both these approaches are biased, the former because a data discharge leads to a loss of information and precision, and it is further biased if the systematically missing variables are not missing completely at random (MCAR), the latter because results remains potentially confounded. The widely used two-stage approach, although flexible as allows to use the entire IPD-MA dataset, also potentially suffers from residual confounding in studies with systematically missing confounders. Let us consider an IPD-MA of i=1,…, N studies with j=1,….,Ni subjects in the i-th study. We denote the observed dichotomous outcome (Y) for subject j in study i as yij and, without loss of generality, we consider a scenario with two binary covariates X1 and X2, where X1 is the exposure variable and X2 a confounder that is systematically missing in M studies, with 0<M<N. We anticipate some degree of heterogeneity between the studies, and therefore consider hereafter random-effects models. The FPAMA method [6] is a two-stage approach based on the borrowing of strength concept among studies that can provide both fully and partially adjusted estimates, and those who can give only partially adjusted estimates. This is not a MI based methods, and thus missing values of X2 are not replaced through imputed values, but the relationship between the fully and partially adjusted estimates for the N-M complete studies is assumed to apply to the M studies where only partially adjusted estimates can be obtained. In the first step, for N-M studies, i.e. those without the systematically missing confounder X2, the following fully and partially adjusted models can be fitted: "y" _"j" "~Bernoulli(" "π" _"j" ")" 〖"fully adjusted: η" 〗_"j" "=logit" ("π" _"i" )"=" "β" _"0" "+" "β" _"1" ^"f" "x" _"1j" "+" "β" _"2" "x" _"2j" " for=1,…,N" 〖"partially adjusted: η" 〗_"j" "=logit" ("π" _"i" )"=" "β" _"0" "+" "β" _"1" ^"p" "x" _"1j" " for=1,…,N." For M studies where X2 is systematically missing, only the partially adjusted model can be fitted to the data. In the second step, it can be assumed for each study the following bivariate model: (■("β" ̂_"1" ^"f" @"β" ̂_"1" ^"p" ))"~N" ((■("β" _"1" ^"f" @"β" _"1" ^"p" ))"," (■("σ" _"1" ^"2" &"ρ" "σ" _"1" "σ" _"2" @"ρ" "σ" _"1" "σ" _"2" &"σ" _"2" ^"2" ))) where "σ" _"1" ^"2" , "σ" _"2" ^"2" and ρ are assumed to be known. In practice, variances of fully and partially adjusted regression coefficients "β" _"1" ^"f" and "β" _"1" ^"p" can be simply obtained using any statistical packages, while the correlation coefficient "ρ=Corr(" "β" _"1" ^"f" "," "β" _"1" ^"p" ")" for the N-M studies can be obtained by bootstrapping [6]. Under the missing at random (MAR) assumption and for computational convenience, the M studies where X2 is systematically missing can be included in the multivariate model by giving arbitrary values to regression coefficients, i.e. "β" ̂_"1" ^"f" "=0" , with very large within-study variances, i.e. "σ" ̂_"1" ^"2" "=" 〖"10" 〗^"6" , and null within-study correlations ρ [9]. Being heterogeneity between studies a matter of fact in IPD meta-analysis, the couple ("β" _"1" ^"f" "," "β" _"1" ^"p" ) can be modeled as: (■("β" _"1" ^"f" @"β" _"1" ^"p" ))"~N" ((■("β" _"1" ^"f" @"β" _"1" ^"p" ))"," (■("τ" _"1" ^"2" &"ρ" "σ" _"1" "σ" _"2" "k" "τ" _"1" "τ" _"2" @"k" "τ" _"1" "τ" _"2" &"σ" _"2" ^"2" "+" "τ" _"2" ^"2" ))), leading to the marginal random-effects bivariate model: (■("β" ̂_"1" ^"f" @"β" ̂_"1" ^"p" ))"~N" ((■("β" _"1" ^"f" @"β" _"1" ^"p" ))"," (■("σ" _"1" ^"2" "+" "τ" _"1" ^"2" &"ρ" "σ" _"1" "σ" _"2" "+k" "τ" _"1" "τ" _"2" @"ρ" "σ" _"1" "σ" _"2" "+k" "τ" _"1" "τ" _"2" &"τ" _"2" ^"2" ))). This model can be fitted either by ML or REML methods. Inference is done on the fully adjusted regression coefficient "β" ̂_"1" ^"f" but the model, unlike methods based on MI, does not provide the pooled estimates of the systematically missing confounder(s) X2. We used R package “mvmeta” with maximum likelihood (ML) estimation method. The MLMI approach by Jolani et al. [7] is based on a generalized linear mixed model (GLMM) for specifying the conditional distributions (fully conditional specification, FCS) by using the Wishart distribution to draw estimates of uncertainty around the variance component of the model. The rationale beyond MI is to create D copies of the IPD-MA dataset where missing values are replaced by random draws made from the posterior distribution of missing values giving the observed data, under the MAR assumption. For N-M studies where X2 is observed, the following GLMM model is fitted: "x" _"2ij" "~Bernoulli(" "π" _"ij" ")" "ξ" _"ij" "=logit" ("π" _"ij" )"=" "α" _"0i" "+" "α" _"1i" "x" _"1ij" "+" "α" _"2i" "y" _"ij" where "α" _"ki" "=" "α" _"k" "+" "a" _"ki" " " "a" _"ki" "~N" ("0," "ψ" _"k" ^"2" )" for k=0,1,2." After obtaining estimates of "α" (fixed component) and "Ψ" and "a" _"i" (random component), we get a random draw "γ" _"k" ^"*" "~MVN(" "α" ̂_"k" ",var(" "α" ̂_"k" "|" "Ψ" ̂_"k" "))" and compute "Λ" _"k" "=" ∑_"i=1" ^"N-M" ▒〖"a" _"ki" "a" _"ki" ^"T" 〗 for k=0,1,2. For the M studies where X2 is systematically missing, for D times draw "Ψ" _"k" ^"*-1" "~Wishart(" "Λ" _"k" ^"-1" ",N-M)" - being the Wishart the posterior distribution of "Λ" - get "a" _"ki" ^"*" "~MVN(0," "Ψ" _"k" ^"*" ")" for k=0,1,2, and finally obtain the imputed values for X2 as "x" _"2ij" ^"*" "=" 〖"logit" 〗^"-1" [〖"(γ" 〗_"0" ^"*" "+" "a" _"0" ^"*" ")+" 〖"(γ" 〗_"1" ^"*" "+" "a" _"1" ^"*" ")" "x" _"1ij" "+(" "γ" _"2" ^"*" "+" "a" _"2" ^"*" ")" "y" _"ij" ]. After imputation, the D copies of the IPD-MA dataset can be analyzed through the one-stage approach. This means that for each imputed dataset we get D fixed and random regression coefficients estimates. The final combined estimates of the fixed and random components can be obtained through the Rubin’s rule [10]. We used the extension developed by Jolani et al. [7] to the “mice” R package to fit the model, adopting ML estimation through Laplacian approximation. We used data from the first release of the StoP project dataset, including 23 case-control studies for a total of 10,290 GC cases and 26,153 controls [1]. Socioeconomic status was defined using educational level and categorized in each study in low, intermediate and high according to the International Standard Classification of Education (ISCED). Models included terms for age (<40, 40-44, 45-49, 50–54, 55–59, 60–64, 65–69, 70–74, ≥75 years), sex, tobacco smoking (never, former, current ≤10 cigarettes/day, >10 to ≤20 cigarettes/day, and >20 cigarettes/day), fruit and vegetable consumption (study-specific tertiles), study centre (for multicentric studies) and H. pylori infection. This latter variable was systematically missing in some studies of the consortium. To overcome this limitation, we applied the FPAMA and MLMI approaches described above to estimate the association between socioeconomic status and GC risk while allowing for H. Pylori infection. Results. We found a strong association between socioeconomic status and risk of GC. The two investigated methods, FPAMA and MLMI, gives similar results (Figure 1). Intermediate (MLMI method: odds ratio (OR) 0.76, 95% confidence interval(CI) 0.64-0.90, FPAMA method: OR 0.75, 95% CI 0.60-0.92) and higher socioeconomic status (MLMI method: OR 0.55, 95% CI 0.46-0.65, FPAMA method: OR 0.52, 95% CI 0.42-0.64) were associated with a reduced risk of GC. Standard errors for fixed effects were somewhat smaller for the MLMI methods as compared to FPAMA (0.089 vs 0.109 for intermediate socioeconomic status and 0.165 vs 0.2 for high socioeconomic status). Figure 1. Odds ratio estimates for the association between socioeconomic status and GC risk in the StoP project consortium deriving from different approaches (FPAMA and MLMI) to deal with systematically missing confounders. Conclusions. From a methodological point of view, both FPAMA and MLMI are feasible to account for systematically missing confounders, leading to very similar results. However, the MLMI method is more flexible than the FPAMA method as it can be used to deal with sporadically missing values and can also handle more complex missing patterns. The value of the methods here presented and discussed is greater when the dataset is smaller as compared to the StoP dataset.

Dealing with systematically missing confounders in individual participant data meta-analysis : an application to the relationship between socioeconomic status and gastric cancer risk in the Stomach cancer Pooling (StoP) project / M. Rota, F. Rumi, C. Pelucchi, N. Lunet, S. Boccia, E. Negri, C. La Vecchia. ((Intervento presentato al 11. convegno La statistica a supporto della salute: dalla prevenzione alla continuità delle cure tenutosi a Gargnano nel 2017.

Dealing with systematically missing confounders in individual participant data meta-analysis : an application to the relationship between socioeconomic status and gastric cancer risk in the Stomach cancer Pooling (StoP) project

M. Rota^Methodology;Rumi, F;C. Pelucchi;Lunet, N;Boccia, S;E. Negri;C. La Vecchia

2017

Abstract

Introduction. Despite incidence and mortality have been falling in most areas of the world over the last several decades, there are still about one million new diagnoses of gastric cancer (GC) per year worldwide, and GC remains the third leading cause of cancer mortality. The earlier reduction of the burden of GC in high income countries - largely explained by downwards trends in the prevalence of Helicobacter pylori (H. pylori) infection and by improvements in diet and food conservation – led to neglect this neoplasm in terms of research and development efforts. However, the study of aetiological factors remains a global priority for the prevention and control for GC. The Stomach Cancer Pooling (StoP) Project is an international consortium of case-control (CC), including nested CC, studies on GC started in 2012 [1]. This consortium joins together scientists from several areas of the world aimed to study the role of lifestyle, dietary habits and genetic determinants in GC aetiology, through an individual participant data meta-analysis (IPD-MA). IPD-MA is considered to be a gold standard approach in evidence synthesis as allows access to raw data from each participant, data checking, verification and centralized recoding of data according to common definitions, thus minimizing publication and reporting bias typical of aggregated data meta-analysis [2]. At the time of writing, 31 case-control studies agreed to participate to the StoP project, for a total of more than 14,000 GC cases and 33,700 healthy controls (StoP dataset release 2). This data represented a unique opportunity to better quantify the association between alcohol drinking and tobacco smoking on the risk of GC [3, 4]. Moreover, it is an opportunity for medical statisticians to explore and apply IPD-MA related methods, and their drawbacks [5]. Among them, a hot discussed topic in the literature is the choice of the analysis method, i.e. one-stage or two-stage, and the management of systematically missing confounding factors, i.e. variables not collected in one or more studies of the IPD-MA. Aims. To give a brief overview of the one-stage and two-stage analyses methods, we applied some recently developed advanced techniques to address the problem of systematically missing variables in an IPD-MA. Among them, we dealt with i) a multivariate random effect model in a two-stage framework (hereafter called “Fully and Partially Adjusted Meta-Analysis” - FPAMA) [6] and ii) a one-step analysis based on Multiple Imputation (MI) by Chained Equations (MICE), hereafter called multilevel multiple imputation (MLMI) [7]. These methods have been applied within the StoP project to study the relationship between socioeconomic status and GC risk while allowing for adjustment of H. Pylori infection, a variable that has not been collected in some case-control studies of the StoP consortium. Methods. Statistical methods for IPD-MA should take into account the hierarchical structure of the data in which individuals are nested within studies. Two approaches have been described: a one-stage and a two-stage approach [8]. In the one-stage approach data are analysed in a single step through a random-effects model, while in the two-stage approach each study is analysed separately, and results combined through standard meta-analytic techniques. Several studies investigating differences between one-stage and two-stage IPD-MA either theoretically, empirically or via simulation concluded that “largely the two approaches produce similar results” [5,8]. In the presence of one, or more, systematically missing confounders, researchers could choose to carry out a complete data analysis by including only studies without systematically missing variables, or a complete confounder analysis including all the studies but adjusting only for those confounders available in all the studies. Both these approaches are biased, the former because a data discharge leads to a loss of information and precision, and it is further biased if the systematically missing variables are not missing completely at random (MCAR), the latter because results remains potentially confounded. The widely used two-stage approach, although flexible as allows to use the entire IPD-MA dataset, also potentially suffers from residual confounding in studies with systematically missing confounders. Let us consider an IPD-MA of i=1,…, N studies with j=1,….,Ni subjects in the i-th study. We denote the observed dichotomous outcome (Y) for subject j in study i as yij and, without loss of generality, we consider a scenario with two binary covariates X1 and X2, where X1 is the exposure variable and X2 a confounder that is systematically missing in M studies, with 010 to ≤20 cigarettes/day, and >20 cigarettes/day), fruit and vegetable consumption (study-specific tertiles), study centre (for multicentric studies) and H. pylori infection. This latter variable was systematically missing in some studies of the consortium. To overcome this limitation, we applied the FPAMA and MLMI approaches described above to estimate the association between socioeconomic status and GC risk while allowing for H. Pylori infection. Results. We found a strong association between socioeconomic status and risk of GC. The two investigated methods, FPAMA and MLMI, gives similar results (Figure 1). Intermediate (MLMI method: odds ratio (OR) 0.76, 95% confidence interval(CI) 0.64-0.90, FPAMA method: OR 0.75, 95% CI 0.60-0.92) and higher socioeconomic status (MLMI method: OR 0.55, 95% CI 0.46-0.65, FPAMA method: OR 0.52, 95% CI 0.42-0.64) were associated with a reduced risk of GC. Standard errors for fixed effects were somewhat smaller for the MLMI methods as compared to FPAMA (0.089 vs 0.109 for intermediate socioeconomic status and 0.165 vs 0.2 for high socioeconomic status). Figure 1. Odds ratio estimates for the association between socioeconomic status and GC risk in the StoP project consortium deriving from different approaches (FPAMA and MLMI) to deal with systematically missing confounders. Conclusions. From a methodological point of view, both FPAMA and MLMI are feasible to account for systematically missing confounders, leading to very similar results. However, the MLMI method is more flexible than the FPAMA method as it can be used to deal with sporadically missing values and can also handle more complex missing patterns. The value of the methods here presented and discussed is greater when the dataset is smaller as compared to the StoP dataset.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di presentazione
	
			15-set-2017
		
	Parole chiave
	
			systematically missing data; individual participant data meta-analyses; socioeconomic status; gastric cancer risk; StoP project
		
	Settori scientifico-disciplinari dell'intervento
	
			Settore MED/01 - Statistica Medica
		
	Citazione
	
			Dealing with systematically missing confounders in individual participant data meta-analysis : an application to the relationship between socioeconomic status and gastric cancer risk in the Stomach cancer Pooling (StoP) project / M. Rota, F. Rumi, C. Pelucchi, N. Lunet, S. Boccia, E. Negri, C. La Vecchia. ((Intervento presentato al 11. convegno La statistica a supporto della salute: dalla prevenzione alla continuità delle cure tenutosi a Gargnano nel 2017.
		
	Tipologia
	
			Conference Object
		
	Appare nelle tipologie:
	
			14 - Intervento a convegno non pubblicato

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/554026

Citazioni

ND

ND

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Dealing with systematically missing confounders in individual participant data meta-analysis : an application to the relationship between socioeconomic status and gastric cancer risk in the Stomach cancer Pooling (StoP) project

M. Rota^Methodology;Rumi, F;C. Pelucchi;Lunet, N;Boccia, S;E. Negri;C. La Vecchia

Methodology

2017

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Pubblicazioni consigliate

Citazioni

social impact

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Dealing with systematically missing confounders in individual participant data meta-analysis : an application to the relationship between socioeconomic status and gastric cancer risk in the Stomach cancer Pooling (StoP) project

M. Rota Methodology;Rumi, F;C. Pelucchi;Lunet, N;Boccia, S;E. Negri;C. La Vecchia

Methodology

2017

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Citazioni

social impact

Conferma cancellazione

M. Rota^Methodology;Rumi, F;C. Pelucchi;Lunet, N;Boccia, S;E. Negri;C. La Vecchia

Scheda breve

Scheda completa

Scheda completa (DC)