Prediction models are increasingly developed and used in diagnostic and prognostic studies, where the use of machine learning (ML) methods is becoming more and more popular over traditional regression techniques. For survival outcomes the Cox proportional hazards model is generally used and it has been proven to achieve good prediction performances with few strong covariates. The possibility to improve the model performance by including nonlinearities, covariate interactions and time-varying effects while controlling for overfitting must be carefully considered during the model building phase. On the other hand, ML techniques are able to learn complexities from data at the cost of hyper-parameter tuning and interpretability. One aspect of special interest is the sample size needed for developing a survival prediction model. While there is guidance when using traditional statistical models, the same does not apply when using ML techniques. This work develops a time-to-event simulation framework to evaluate performances of Cox regression compared, among others, to tuned random survival forest, gradient boosting, and neural networks at varying sample sizes. Simulations were based on replications of subjects from publicly available databases, where event times were simulated according to a Cox model with nonlinearities on continuous variables and time-varying effects and on the SEER registry data.

Sample size and predictive performance of machine learning methods with survival data: A simulation study / G. Infante, R. Miceli, F. Ambrogi. - In: STATISTICS IN MEDICINE. - ISSN 0277-6715. - 42:30(2023 Dec 30), pp. 5657-5675. [10.1002/sim.9931]

Sample size and predictive performance of machine learning methods with survival data: A simulation study

G. Infante
Primo
;
F. Ambrogi
Ultimo
2023

Abstract

Prediction models are increasingly developed and used in diagnostic and prognostic studies, where the use of machine learning (ML) methods is becoming more and more popular over traditional regression techniques. For survival outcomes the Cox proportional hazards model is generally used and it has been proven to achieve good prediction performances with few strong covariates. The possibility to improve the model performance by including nonlinearities, covariate interactions and time-varying effects while controlling for overfitting must be carefully considered during the model building phase. On the other hand, ML techniques are able to learn complexities from data at the cost of hyper-parameter tuning and interpretability. One aspect of special interest is the sample size needed for developing a survival prediction model. While there is guidance when using traditional statistical models, the same does not apply when using ML techniques. This work develops a time-to-event simulation framework to evaluate performances of Cox regression compared, among others, to tuned random survival forest, gradient boosting, and neural networks at varying sample sizes. Simulations were based on replications of subjects from publicly available databases, where event times were simulated according to a Cox model with nonlinearities on continuous variables and time-varying effects and on the SEER registry data.
No
English
machine learning; prediction; sample size; simulation; time-to-event
Settore MED/01 - Statistica Medica
Articolo
Esperti anonimi
Pubblicazione scientifica
   Innovative statistical methods in biomedical research on biomarkers: from their identification to their use in clinical practice
   MINISTERO DELL'ISTRUZIONE E DEL MERITO
   20178S4EK9_004
30-dic-2023
10-nov-2023
John Wiley and Sons Ltd
42
30
5657
5675
19
Pubblicato
Periodico con rilevanza internazionale
INDACO
INDACO
scopus
orcid
pubmed
crossref
Aderisco
info:eu-repo/semantics/article
Sample size and predictive performance of machine learning methods with survival data: A simulation study / G. Infante, R. Miceli, F. Ambrogi. - In: STATISTICS IN MEDICINE. - ISSN 0277-6715. - 42:30(2023 Dec 30), pp. 5657-5675. [10.1002/sim.9931]
open
Prodotti della ricerca::01 - Articolo su periodico
3
262
Article (author)
Periodico con Impact Factor
G. Infante, R. Miceli, F. Ambrogi
File in questo prodotto:
File Dimensione Formato  
Statistics in Medicine - 2023 - Infante - Sample size and predictive performance of machine learning methods with survival.pdf

accesso aperto

Descrizione: Research Article
Tipologia: Publisher's version/PDF
Dimensione 3.51 MB
Formato Adobe PDF
3.51 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1023543
Citazioni
  • ???jsp.display-item.citation.pmc??? 0
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 2
social impact