Introduction and objectives: Over recent decades, the exponential growth of data, especially in healthcare, has necessitated advanced analytical methods. Conventional machine learning algorithms often assume independence among data points, limiting their effectiveness with longitudinal and hierarchical data. This study introduces a novel algorithm called GMEXGBoost, a methodological extension of generalized mixed-effects models that leverages the boosting framework of XGBoost for estimating fixed effects while simultaneously accounting for random effects. The innovation lies in GMEXGBoost's ability to explicitly incorporate data correlations while retaining the predictive power of boosted trees. Methods: The GMEXGBoost model was evaluated through extensive simulations and a real-world cohort study, benchmarking against GLMM, GLMMTree, GMERF, and XGBoost. Also, its performance was assessed using predictive mean absolute deviation (PMAD), predictive misclassification rate (PMCR), sensitivity, specificity, accuracy, and AUC. Simulation analyses were conducted using multiple synthetic datasets, each comprising training and testing groups with varying effect structures, including random intercepts and slopes. All computations were performed in RStudio(version 2023.06.0). Results: Our results indicate that while XGBoost achieved the lowest average errors across most scenarios, GMEXGBoost consistently demonstrated superior stability and accuracy when random-effect variance was large or correlations were strong. Also, in real data, GMEXGBoost outperformed other models in terms of the performance metrics. Conclusion: The GMEXGBoost algorithm, by combining the estimates of the GLMM and XGBoost models, leverages the capabilities of both and delivers improved performance in complex problems. Although it is not universally superior, but demonstrates clear advantages in the analysis of hierarchical and longitudinal datasets with strong correlations. These properties make it a valuable tool for decision-making in healthcare and other domains that involve complex, structured data.

Innovative statistical method for longitudinal and hierarchical data modeling: the GMEXGBoost method / F. Asadi, R. Homayounfar, Y. Mehrali, C. Masci, F. Zayeri. - In: BMC MEDICAL RESEARCH METHODOLOGY. - ISSN 1471-2288. - 26:1(2026 Jan), pp. 24.1-24.13. [10.1186/s12874-025-02751-7]

Innovative statistical method for longitudinal and hierarchical data modeling: the GMEXGBoost method

C. Masci;
2026

Abstract

Introduction and objectives: Over recent decades, the exponential growth of data, especially in healthcare, has necessitated advanced analytical methods. Conventional machine learning algorithms often assume independence among data points, limiting their effectiveness with longitudinal and hierarchical data. This study introduces a novel algorithm called GMEXGBoost, a methodological extension of generalized mixed-effects models that leverages the boosting framework of XGBoost for estimating fixed effects while simultaneously accounting for random effects. The innovation lies in GMEXGBoost's ability to explicitly incorporate data correlations while retaining the predictive power of boosted trees. Methods: The GMEXGBoost model was evaluated through extensive simulations and a real-world cohort study, benchmarking against GLMM, GLMMTree, GMERF, and XGBoost. Also, its performance was assessed using predictive mean absolute deviation (PMAD), predictive misclassification rate (PMCR), sensitivity, specificity, accuracy, and AUC. Simulation analyses were conducted using multiple synthetic datasets, each comprising training and testing groups with varying effect structures, including random intercepts and slopes. All computations were performed in RStudio(version 2023.06.0). Results: Our results indicate that while XGBoost achieved the lowest average errors across most scenarios, GMEXGBoost consistently demonstrated superior stability and accuracy when random-effect variance was large or correlations were strong. Also, in real data, GMEXGBoost outperformed other models in terms of the performance metrics. Conclusion: The GMEXGBoost algorithm, by combining the estimates of the GLMM and XGBoost models, leverages the capabilities of both and delivers improved performance in complex problems. Although it is not universally superior, but demonstrates clear advantages in the analysis of hierarchical and longitudinal datasets with strong correlations. These properties make it a valuable tool for decision-making in healthcare and other domains that involve complex, structured data.
Boosted tree; Generalized linear mixed model; Longitudinal and hierarchical data
Settore STAT-01/A - Statistica
gen-2026
Article (author)
File in questo prodotto:
File Dimensione Formato  
s12874-025-02751-7.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Licenza: Creative commons
Dimensione 1.99 MB
Formato Adobe PDF
1.99 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1215356
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex 0
social impact