PATIENT SIMILARITY NETWORKS-BASED METHODS FOR MULTIMODAL DATA INTEGRATION AND CLINICAL OUTCOME PREDICTION

Gliozzo, J.

Artificial Intelligence is significantly contributing to the development of Precision Medicine, a revolutionary approach in healthcare. This new medical paradigm leverages heterogeneous patients' descriptions to enhance disease prevention, diagnosis, prognosis, and treatment. The progress of Precision Medicine significantly relies on the ability to collect comprehensive data from each patient, covering various aspects of their disease. This includes gathering information at different levels to form a complete picture of the pathology, incorporating genomic, environmental, and lifestyle factors. Recent technological advancements have facilitated the ongoing collection of a vast array of diverse health data. This includes clinical data, radiological and digital pathology images, molecular data, and information from wearable devices and smartphones. Such developments have ushered the field of biomedicine into the era of "Big Data", presenting unique challenges in data analysis. Among these various data types, multi-omics data has become the gold standard for the in-depth molecular-level characterization of individuals. Multi-omics data involves measuring a comprehensive set of biomolecules that describe a specific biological layer. Examples include DNA sequencing, gene expression levels, DNA methylation, and protein expression. As a result, this data is both multimodal and high-dimensional, offering a more nuanced understanding of biological processes. Network Medicine is an innovative and interdisciplinary field of bioinformatics research that applies the principles of network science to understand and treat diseases. It revolves around the concept that most diseases are not caused by individual genes, but rather by complex interactions within biological networks. In Network Medicine, diseases are viewed as the result of perturbations in these networks. By understanding these network interactions and their degradation, researchers can better understand the multifactorial nature of diseases, therefore providing crucial insights that enable Precision Medicine to develop more targeted and effective prevention, diagnosis, prognosis, and treatment strategies. In the current landscape, Patient Similarity Networks (PSNs) are emerging as a powerful tool for representing patients' clinical and molecular profiles. These networks adapt the principles of Network Medicine, originally applied to biomolecular networks, to patient data. In a PSN, each node represents a patient, and the edges express the degree of similarity between patients' clinical or biomolecular profiles. PSNs have proven particularly effective in key Precision Medicine tasks such as patient subtyping and predicting clinical outcomes through clustering and classification techniques. They offer the benefits of being interpretable and privacy-preserving, and are capable of integrating multimodal data. The growing availability of multi-omics data, combined with the impressive performance of PSN-based models, underscores the need for developing sophisticated methods to integrate multimodal data in constructing PSNs. This approach promises to enhance the effectiveness of PSNs in capturing the nuances of patient profiles, further advancing the capabilities of Precision Medicine. The first contribution of this thesis is an extensive literature review that collects and critically analyses multimodal data fusion methods for the computation of PSNs. Firstly, recognizing that the development of PSNs is highly dependent on the selection of patient similarity measures, we conduct a comprehensive review of the most commonly used and effective similarity metrics, encompassing multiple data types, including categorical, discrete, continuous, and binary. Next, considering the wide and varied range of literature works about multimodal data fusion techniques with numerous taxonomies proposed over the years, we compiled and organized all of them to offer a clear starting point for exploring this field. The acquired knowledge allowed us to propose a novel taxonomy for multimodal data integration approaches that allows building and analyzing PSNs. In more detail, we grouped all the surveyed approaches into three categories (PSN-fusion methods, Input data-fusion methods, and Output-fusion methods) and identified their advantages and drawbacks. To provide readers with comprehensive knowledge, we considered not only bioinformatics methods but also approaches proposed in the machine learning literature, that could be potentially leveraged to build PSN. For the same reason, we did not limit the discussion to methods for the integration of multi-omics data, as done by other reviews in this context, but we also considered approaches developed to fuse clinical data and medical images. Besides the faceted semantics of multi-omics data, which requires proper integration approaches, -omics data is characterized by high-dimensionality opposed to a limited sample size ("small-sample-size" problem). The high level of sparsity in the resulting datasets often leads to high computational costs and to biased (supervised and unsupervised) analysis, mainly due to redundant and noisy features. These problems can be mitigated by the application of unsupervised feature selection and feature extraction approaches to reduce the dimensionality of individual views. However, dimensionality reduction is a challenging task; besides the choice of the reduction algorithm, the choice of the lower dimension is a critical task that should be properly chosen. Several bioinformatics works often neglect the importance of this pre-processing step and set the dimensionality of each view using some heuristic value that is the same for each view, thus ignoring that different views can bring different amount of information and that some of them might not even need reduction. To deal with this issue, the second contribution of this thesis is a novel dimensionality reduction (DR) approach that exploits block-analysis to obtain an unbiased estimate of the intrinsic dimensionality (id) of individual data sources. Besides providing an unbiased id estimate that can be used to lower the dimension of the reduction space where the view under analysis should be projected, the analysis of the distribution of the id estimates for increasing block sizes provides hints about the amount of noise and redundancy affecting the view. Leveraging this analysis, we automatically tailored the DR method, so that views affected by higher amounts of noise and redundancy undergo two consecutive DR steps (feature selection followed by feature extraction), while views with negligible amount of noise and redundancy undergo a unique feature extraction step. We extensively validated our proposal by applying a supervised task (outcome prediction via Random Forest classifiers) on nine multi-omics cancer datasets. The experiments allowed the unbiased comparative evaluation of different state-of-the-art feature selection/extraction approaches, and data fusion methods. Experiments evidenced that: (I) the usage of the id estimate outperforms the use of common heuristics to set the dimension of the lower dimensional space; (II) the two-step DR approach we proposed is an effective solution that allows computing a robust representation of all the available -omics sources, avoiding the time and costs needed to assess and compare the usage of different -omics combinations; (III) we also showed that, in our problem, the integration of demographic and multi-omics data improves classification performance. The last important contribution of this thesis is the proposal of a data integration method for "partially missing data". More precisely, one of the main problems hampering the application of many effective state-of-the-art multi-omics data fusion techniques is due to the frequent incompleteness of the -omics data sources, meaning that some patients have one or more completely missing data sources. Most data integration approaches are not able to handle these (partial) datasets, which are however frequent in medical research and in clinical practice. Moreover, they are often designed and/or evaluated to tackle specific learning tasks, where unsupervised clustering is the most common. To address these points we propose miss-SNF, a novel task-agnostic data fusion approach that, leveraging a nonlinear message-passing learning strategy, integrates incomplete datasets and partially reconstructs missing pairwise similarities by exploiting patients’ neighbourhoods. The computed integrated PSN can be used to perform unsupervised clustering and (semi-)supervised classification tasks. We evaluated miss-SNF on the integration of nine multi-omics datasets having increasing percentages of completely missing data sources in each view, considering the classification of two clinical outcomes and patients' clustering. We show that miss-SNF has a superior ability to recover missing pairwise similarities while maintaining network topology w.r.t. state-of-the-art data integration approaches (i.e. NEMO, MOFA+). It leads to clusters enriched for clinical variables and capturing differences in differential survival. The integrated PSN can be effectively exploited to perform classification leveraging both supervised and semi-supervised learning algorithms (i.e. Random Forests, label propagation and guilt-by-association), thus being potentially used to integrate datasets having unlabeled patients that are common in multi-omics datasets. To the best of our knowledge, this is the first time that a multimodal integration method for incomplete datasets is extensively evaluated for such different learning tasks. The thesis is structured as follows. Chapter 1: introduces the context of the work, focusing on the use of Patient Similarity Network (PSN)-based methods for integrating multimodal data in Precision Medicine. Chapter 2: This chapter provides a thorough review and analysis of data integration approaches for constructing PSNs, highlighting existing challenges in the field, for which we provide our solution in the subsequent chapters. Chapter 3: presents the author's dimensionality reduction approach, including the experimental setup and results demonstrating its effectiveness. Chapter 4: details the miss-SNF algorithm and its evaluation in the reconstruction and integration of partial datasets. Experimental results in unsupervised and supervised analysis of multimodal PSNs show that miss-SNF is competitive with state-of-the-art methods. Chapter 5: summarizes the main contributions of the thesis and outlines potential future work in this area.

PATIENT SIMILARITY NETWORKS-BASED METHODS FOR MULTIMODAL DATA INTEGRATION AND CLINICAL OUTCOME PREDICTION / J. Gliozzo ; supervisor: G. Valentini ; co-supervisor: E. Casiraghi, A. Patak ; coordinator: R. Sassi. Dipartimento di Informatica Giovanni Degli Antoni, 2024 Apr. 36. ciclo, Anno Accademico 2022/2023.