The wide availability of transcriptomics data, along with the advancement of experimental technologies, computational tools, and statistical methods like machine learning (ML), has significantly improved the use of transcriptome data in cancer research, benefiting both diagnostic and prognostic clinical applications. Gene expression is used to predict the tissue of origin (TOO) of tumor samples, which is relevant in several scenarios such as the diagnosis of cancer of unknown primary (CUP), early cancer detection by means of circulating tumor cells (CTCs) or cell-free RNA (cfRNA) and the identification of mislabelled cell lines. So far those problems have been addressed separately but we ask if one classifier is sufficient to solve them all and investigate what are the determinants of cancer origin prediction performance. To this end, we develop TOPOS (Tissue of Origin Predictor of Onco-Samples), a ML classifier that can distinguish 15 different tissues and we test it on more than 36,000 samples spanning primary tumors, metastases, CTCs, patient-derived xenografts (PDXs), cancer cell lines and organoids. We demonstrate that by adopting a specific training composition and data processing approach, we can achieve high performance across different onco-types, including CTCs, suggesting the applicability of gene expression data in liquid biopsy. From cancer transcriptomic profiles we can extract features that hold information about tumors’ prognosis. To identify prognostic biomarkers, statistical methods like Cox regression have been adapted from survival analysis techniques. However, these methods rely on assumptions that may not always hold and restrict charting of feature-survival interaction, like for instance non-monotonic relationships, which have been shown to exist for mutation load and chromosomal instability (CIN). We wonder whether a similarly non-monotonic relationship occurs also at gene expression level and demonstrate that for some genes a poor or favorable outcome is found at a medium transcription level and that the shape of gene expression distribution plays a role in determining which expression range is more likely to be associated with good or bad patient prognosis. In this study, we have examined a diagnostic and prognostic application of gene expression data and explored how computational techniques can enhance its use in cancer research.
INFERRING TUMOR PROPERTIES FROM GENE EXPRESSION DATA / D. Cagnina ; tutor: M. Schaefer co-tutor: S. Santaguida. Dipartimento di Oncologia ed Emato-Oncologia, 2025 Jan 21. 36. ciclo, Anno Accademico 2023/2024.
INFERRING TUMOR PROPERTIES FROM GENE EXPRESSION DATA
D. Cagnina
2025
Abstract
The wide availability of transcriptomics data, along with the advancement of experimental technologies, computational tools, and statistical methods like machine learning (ML), has significantly improved the use of transcriptome data in cancer research, benefiting both diagnostic and prognostic clinical applications. Gene expression is used to predict the tissue of origin (TOO) of tumor samples, which is relevant in several scenarios such as the diagnosis of cancer of unknown primary (CUP), early cancer detection by means of circulating tumor cells (CTCs) or cell-free RNA (cfRNA) and the identification of mislabelled cell lines. So far those problems have been addressed separately but we ask if one classifier is sufficient to solve them all and investigate what are the determinants of cancer origin prediction performance. To this end, we develop TOPOS (Tissue of Origin Predictor of Onco-Samples), a ML classifier that can distinguish 15 different tissues and we test it on more than 36,000 samples spanning primary tumors, metastases, CTCs, patient-derived xenografts (PDXs), cancer cell lines and organoids. We demonstrate that by adopting a specific training composition and data processing approach, we can achieve high performance across different onco-types, including CTCs, suggesting the applicability of gene expression data in liquid biopsy. From cancer transcriptomic profiles we can extract features that hold information about tumors’ prognosis. To identify prognostic biomarkers, statistical methods like Cox regression have been adapted from survival analysis techniques. However, these methods rely on assumptions that may not always hold and restrict charting of feature-survival interaction, like for instance non-monotonic relationships, which have been shown to exist for mutation load and chromosomal instability (CIN). We wonder whether a similarly non-monotonic relationship occurs also at gene expression level and demonstrate that for some genes a poor or favorable outcome is found at a medium transcription level and that the shape of gene expression distribution plays a role in determining which expression range is more likely to be associated with good or bad patient prognosis. In this study, we have examined a diagnostic and prognostic application of gene expression data and explored how computational techniques can enhance its use in cancer research.File | Dimensione | Formato | |
---|---|---|---|
phd_unimi_R13134.pdf
accesso aperto
Descrizione: Tesi di dottorato
Tipologia:
Altro
Dimensione
6.28 MB
Formato
Adobe PDF
|
6.28 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.