Expression levels of biological samples are affected by the intrinsic heterogeneity of cells and tissue composition. Nevertheless, in bulk transcriptional profiling, each sample is evaluated without considering the presence of multiple subpopulations. This limitation might be extremely critical when analyzing bulk gene expression profiles of cancer samples, where dissecting the mix of cell populations could shed light on the intratumoral heterogeneity and on the molecular mechanisms shaping different cancer behaviors. Since changes in tumor composition can both impact the prediction of patient survival and therapeutic response, reaching high confidence about the real content within these bulk tissues is extremely significant. For this reason, several deconvolution tools have been developed to infer (deconvolve) the signals of each constituent cell type from bulk gene expression data. Historically, these tools have been mainly developed to define leukocyte proportions, and their performance has been mostly validated on profiles of purified cells. In this project, we initially established the state-of-art of existing methods for transcriptional deconvolution. After their evaluation, we finally retained four tools (CIBERSORT, EPIC, ssGSEA and xCell) to define a bioinformatics framework for the deconvolution analysis. Next, using independent and selected studies, we investigated how these selected tools perform on different cell types and data format. First, we assessed presence of potential biases of deconvolution methods using profiles of purified cells from different public datasets. Then, based upon three public single cell RNA-seq datasets from different tumors (breast cancer, lung cancer and melanoma), we evaluated tools capability in estimating different cell types at variable abundances, eventually wrapping these results in an interactive web application named ARDESIA (https://bicciatolab.shinyapps.io/ardesia/). The second part of this work investigated adaptability of the deconvolution analysis pipeline and its application in different conditions. To this end, we exploited a mouse database containing expression data generated in rigorously standardized conditions (ImmGen) to create a novel gene signature able to discriminate a widespread number of immune cellular populations, in particular from the myeloid lineage. When tested on murine purified samples, this new signature was able to discriminate closely related cells with similar transcriptional profiles, like the same cell type from different tissues (e.g. macrophages from alveolar or peritoneal tissue). Based on this validation, we applied the same approach to further investigate subtype heterogeneity in breast cancer (BC). To this end, we started from a dataset of breast cancer subtypes based on immunohistochemistry (IHC) to create a custom gene signature of 230 genes. Then, we applied this signature to deconvolve 2 cohorts of clinically-defined triple negative breast cancer (TNBC) samples. Although both datasets were clinically uniform, deconvolution analysis highlighted a variable degree of heterogeneity in tumor subtypes for about 40% of samples. Test of the TNBC fraction identified through deconvolution with either clinical response or survival refined a subgroup of patients characterized by poorer response and survival due to heterogeneous composition of the tumor. In conclusion, we created a general bioinformatics framework to identify cell subpopulations from bulk transcriptional data by deconvolution analysis. Furthermore, we generated two molecular signatures to addressed bulk heterogeneity either for immune populations in mouse or tumor subtypes in breast tumors.

DEVELOPMENT OF A BIOINFORMATICS FRAMEWORK TO IDENTIFY CELL SUBPOPULATIONS FROM BULK TRANSCRIPTIONAL DATA / A. Grilli ; tutor: C. Battaglia ; co-tutor: S. Bicciato ; coordinator: M. Samaja. Università degli Studi di Milano, 2020 Jan 17. 32. ciclo, Anno Accademico 2019. [10.13130/grilli-andrea_phd2020-01-17].

DEVELOPMENT OF A BIOINFORMATICS FRAMEWORK TO IDENTIFY CELL SUBPOPULATIONS FROM BULK TRANSCRIPTIONAL DATA

A. Grilli
2020

Abstract

Expression levels of biological samples are affected by the intrinsic heterogeneity of cells and tissue composition. Nevertheless, in bulk transcriptional profiling, each sample is evaluated without considering the presence of multiple subpopulations. This limitation might be extremely critical when analyzing bulk gene expression profiles of cancer samples, where dissecting the mix of cell populations could shed light on the intratumoral heterogeneity and on the molecular mechanisms shaping different cancer behaviors. Since changes in tumor composition can both impact the prediction of patient survival and therapeutic response, reaching high confidence about the real content within these bulk tissues is extremely significant. For this reason, several deconvolution tools have been developed to infer (deconvolve) the signals of each constituent cell type from bulk gene expression data. Historically, these tools have been mainly developed to define leukocyte proportions, and their performance has been mostly validated on profiles of purified cells. In this project, we initially established the state-of-art of existing methods for transcriptional deconvolution. After their evaluation, we finally retained four tools (CIBERSORT, EPIC, ssGSEA and xCell) to define a bioinformatics framework for the deconvolution analysis. Next, using independent and selected studies, we investigated how these selected tools perform on different cell types and data format. First, we assessed presence of potential biases of deconvolution methods using profiles of purified cells from different public datasets. Then, based upon three public single cell RNA-seq datasets from different tumors (breast cancer, lung cancer and melanoma), we evaluated tools capability in estimating different cell types at variable abundances, eventually wrapping these results in an interactive web application named ARDESIA (https://bicciatolab.shinyapps.io/ardesia/). The second part of this work investigated adaptability of the deconvolution analysis pipeline and its application in different conditions. To this end, we exploited a mouse database containing expression data generated in rigorously standardized conditions (ImmGen) to create a novel gene signature able to discriminate a widespread number of immune cellular populations, in particular from the myeloid lineage. When tested on murine purified samples, this new signature was able to discriminate closely related cells with similar transcriptional profiles, like the same cell type from different tissues (e.g. macrophages from alveolar or peritoneal tissue). Based on this validation, we applied the same approach to further investigate subtype heterogeneity in breast cancer (BC). To this end, we started from a dataset of breast cancer subtypes based on immunohistochemistry (IHC) to create a custom gene signature of 230 genes. Then, we applied this signature to deconvolve 2 cohorts of clinically-defined triple negative breast cancer (TNBC) samples. Although both datasets were clinically uniform, deconvolution analysis highlighted a variable degree of heterogeneity in tumor subtypes for about 40% of samples. Test of the TNBC fraction identified through deconvolution with either clinical response or survival refined a subgroup of patients characterized by poorer response and survival due to heterogeneous composition of the tumor. In conclusion, we created a general bioinformatics framework to identify cell subpopulations from bulk transcriptional data by deconvolution analysis. Furthermore, we generated two molecular signatures to addressed bulk heterogeneity either for immune populations in mouse or tumor subtypes in breast tumors.
17-gen-2020
Settore BIO/10 - Biochimica
Deconvolution; immune infiltration; bulk cancer; gene signature; TNBC; scRNA-seq; deconvolution tools; Cibersort; EPIC; ssGSEA; xCell
BATTAGLIA, CRISTINA
SAMAJA, MICHELE
Doctoral Thesis
DEVELOPMENT OF A BIOINFORMATICS FRAMEWORK TO IDENTIFY CELL SUBPOPULATIONS FROM BULK TRANSCRIPTIONAL DATA / A. Grilli ; tutor: C. Battaglia ; co-tutor: S. Bicciato ; coordinator: M. Samaja. Università degli Studi di Milano, 2020 Jan 17. 32. ciclo, Anno Accademico 2019. [10.13130/grilli-andrea_phd2020-01-17].
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R11684.pdf

Open Access dal 04/07/2021

Descrizione: Manoscritto completo di tesi di dottorato
Tipologia: Tesi di dottorato completa
Dimensione 8.06 MB
Formato Adobe PDF
8.06 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/699966
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact