"Motivation" - Recent advances in high-throughput technologies, data transmission, and data storage, have allowed the generation, acquisition, and storage of huge amounts of data from multiple sources (multimodal datasets). In bioinformatics, exploiting such datasets requires the development of data integration techniques able to discard redundant information while enhancing the information characterizing each source. Among the techniques proposed in literature, early integration methods [doi: 10.1089/10665270252935539] assume that samples lie in a latent space from which multiple source-views are generated by unknown projections. This results in multiple data-views expressed as separate source-specific spaces that are characterized by both an individual and a shared structure (variance), where the latter causes collinearities between data-blocks. Therefore, early methods estimate the embedding into a shared latent space by minimizing the redundancy between the input data-blocks, while maximizing their individual variability. The resulting integrated representation may improve the results of subsequent analysis, supervised learning (e.g. for patient outcome prediction), or unsupervised clustering (e.g. patient subtype identification). "Methods" - We analyzed data from the TCGA dataset through several experimental pipelines where Hierarchical PCA (HPCA, doi: 10.1002/cem.811) and Integrative Non-negative Matrix Factorization (iNMF, doi: 10.1093/bioinformatics/btv544) are compared and integrated. HPCA applies two consecutive PCA steps. The first PCA step is applied on each data-block separately to obtain a lower-dimensional, normalized data-block representation where within-source redundancies are minimized. These lower-dimensional data-blocks are then concatenated and used as input to the second PCA step whose aim is to remove between-source redundancies while extracting salient information. iNMF is an extension of joint Non-negative Matrix Factorization (jNMF, doi: 10.1093/nar/gks725). jNMF solves the data integration problem by solving multiple NMF problems subject to a shared factor matrix, therefore neglecting the individual information characterizing each source. To overcome this limitation, iNMF estimates factor matrices composed of both a shared and an individual structure. Our experimental pipelines are shown in Fig.1 (top). Besides comparing HPCA and iNMF, we also experimented a combination of PCA, for dimensionality reduction, with iNMF for integrating the computed lower dimensional data (PCA- iNMF pipeline). To remove redundancies in between the individual and shared spaces identified by iNMF we added a final PCA to the PCA-iNMF pipeline, therefore essentially integrating HPCA and iNMF (PCA-iNMF-PCA pipeline). "Results" - In our experiments we used multi-omics data from the TCGA Breast cancer dataset. Each patient is described by high-dimensional methylation, miRNA and mRNA profiles, CNV scores and a binary outcome variable (progression-free interval event) The comparative evaluation of the results was performed by visual analysis of the clusters obtained by UMAP [doi: 10.48550/arXiv.1802.03426], a data-dimensionality techniques exploiting a manifold learning approach that guarantees the preservation of the local and global manifold structures, and by the application of Random Forest classifiers (RF, doi: 10.1023/A:1010933404324) on the integrated dataset. Next, considering the data unbalance, we also tested the application of data resampling techniques (undersampling, oversampling, SMOTE - doi: 10.1613/jair.953) and of cost-sensitive- learning during the training phase. In our preliminary results (Fig. 1-bottom), the best performance was achieved by the PCA-iNMF-PCA pipeline.

Comparison of early integration approaches for cancer survival prediction / M. Gnuva, J. Gliozzo, A. Paccanaro, G. Valentini, E. Casiraghi. ((Intervento presentato al 18. convegno Annual Meeting of the Bioinformatics Italian Society tenutosi a Verona nel 2022.

Comparison of early integration approaches for cancer survival prediction

J. Gliozzo;G. Valentini;E. Casiraghi
2022

Abstract

"Motivation" - Recent advances in high-throughput technologies, data transmission, and data storage, have allowed the generation, acquisition, and storage of huge amounts of data from multiple sources (multimodal datasets). In bioinformatics, exploiting such datasets requires the development of data integration techniques able to discard redundant information while enhancing the information characterizing each source. Among the techniques proposed in literature, early integration methods [doi: 10.1089/10665270252935539] assume that samples lie in a latent space from which multiple source-views are generated by unknown projections. This results in multiple data-views expressed as separate source-specific spaces that are characterized by both an individual and a shared structure (variance), where the latter causes collinearities between data-blocks. Therefore, early methods estimate the embedding into a shared latent space by minimizing the redundancy between the input data-blocks, while maximizing their individual variability. The resulting integrated representation may improve the results of subsequent analysis, supervised learning (e.g. for patient outcome prediction), or unsupervised clustering (e.g. patient subtype identification). "Methods" - We analyzed data from the TCGA dataset through several experimental pipelines where Hierarchical PCA (HPCA, doi: 10.1002/cem.811) and Integrative Non-negative Matrix Factorization (iNMF, doi: 10.1093/bioinformatics/btv544) are compared and integrated. HPCA applies two consecutive PCA steps. The first PCA step is applied on each data-block separately to obtain a lower-dimensional, normalized data-block representation where within-source redundancies are minimized. These lower-dimensional data-blocks are then concatenated and used as input to the second PCA step whose aim is to remove between-source redundancies while extracting salient information. iNMF is an extension of joint Non-negative Matrix Factorization (jNMF, doi: 10.1093/nar/gks725). jNMF solves the data integration problem by solving multiple NMF problems subject to a shared factor matrix, therefore neglecting the individual information characterizing each source. To overcome this limitation, iNMF estimates factor matrices composed of both a shared and an individual structure. Our experimental pipelines are shown in Fig.1 (top). Besides comparing HPCA and iNMF, we also experimented a combination of PCA, for dimensionality reduction, with iNMF for integrating the computed lower dimensional data (PCA- iNMF pipeline). To remove redundancies in between the individual and shared spaces identified by iNMF we added a final PCA to the PCA-iNMF pipeline, therefore essentially integrating HPCA and iNMF (PCA-iNMF-PCA pipeline). "Results" - In our experiments we used multi-omics data from the TCGA Breast cancer dataset. Each patient is described by high-dimensional methylation, miRNA and mRNA profiles, CNV scores and a binary outcome variable (progression-free interval event) The comparative evaluation of the results was performed by visual analysis of the clusters obtained by UMAP [doi: 10.48550/arXiv.1802.03426], a data-dimensionality techniques exploiting a manifold learning approach that guarantees the preservation of the local and global manifold structures, and by the application of Random Forest classifiers (RF, doi: 10.1023/A:1010933404324) on the integrated dataset. Next, considering the data unbalance, we also tested the application of data resampling techniques (undersampling, oversampling, SMOTE - doi: 10.1613/jair.953) and of cost-sensitive- learning during the training phase. In our preliminary results (Fig. 1-bottom), the best performance was achieved by the PCA-iNMF-PCA pipeline.
2022
Settore INF/01 - Informatica
Comparison of early integration approaches for cancer survival prediction / M. Gnuva, J. Gliozzo, A. Paccanaro, G. Valentini, E. Casiraghi. ((Intervento presentato al 18. convegno Annual Meeting of the Bioinformatics Italian Society tenutosi a Verona nel 2022.
Conference Object
File in questo prodotto:
File Dimensione Formato  
503-BITS2022-submitted_abstract.pdf

accesso aperto

Tipologia: Pre-print (manoscritto inviato all'editore)
Dimensione 197.85 kB
Formato Adobe PDF
197.85 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1022768
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact