In the era of big data, advanced computational techniques are needed to process, analyze and visualize increasing amounts of data generated by high-throughput technologies. In this context, analyzing biomedical Knowledge Graphs that embrace bio- logical and medical concepts structured in ontologies and data generated from high- throughput bio-technologies represents a central Machine Learning and Computational Biology challenge. Indeed several compelling problems in Network Medicine, ranging from gene-disease prioritization to drug-target prediction or drug repurposing can be modelled as node label or edge prediction problems in graphs, where nodes represent bio-medical entities such as genes, drugs or diseases and edges interactions or relationships between them. Recently Graph Representation Learning (GRL) methods opened new possibilities for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are be- yond the capabilities of current methods and software implementations. To deal with this open problem, the first contribution of this thesis is the design and development of the GRAPE (Graph Processing and Embedding) resource for GRL, able to scale with big graphs thanks to specialized and innovative data structures and algorithms efficiently implemented through parallel computation. Compared with state-of-the-art software resources, GRAPE shows an improvement of orders of magnitude in empirical space and time complexity, as well as a substantial and statistically significant improvement in edge prediction and node label prediction performance. GRAPE provides over 80, 000 graphs from the literature and other sources, standardized interfaces allowing a straightforward integration of third-party libraries, 61 node embedding methods, 25 inference models, and three modular pipelines to allow a findable, accessible, interoperable, and reusable (FAIR) and reproducible comparison of methods and libraries for graph processing and embedding. GRAPE can quickly generate billions of sampled random walks for random-walk-based GRL algorithms, such as DeepWalk of Node2vec, thus leading to very accurate embedded graphs and boosting machine learning methods' performance that learns from the embedded vector representation of nodes and edges. The scaling properties of GRAPE concerning state-of-the- art resources have been analyzed through extensive experiments with real-world big graphs, including, e.g. Wikipedia, the Comparative Toxicogenomic Database (CTD) and a big biomedical Knowledge Graph generated by PheKnowLator. GRAPE significantly outperforms state-of-the-art libraries in terms of empirical time and space complexity, edge prediction performance and can process big graphs even when the other competing state-of-the-art resources fail. As a second contribution for efficiently processing and analyzing big graphs, this the- sis proposes a novel algorithmic framework, efficiently implemented in GRAPE, that we named ALPINE: Abstract Landmark Properties-Inferred Node Embedding. The breakthrough characteristics of this algorithmic framework allow us to deal with several issues affecting SOTA GRL methods: 1. The embedding features are independently computed, each from the others, thus overcoming the space and time complexity limitations due to their dependent computation. 2. Feature computation is based on the "landmarks", i.e. sets of nodes representing meaningful concepts about the structure or the semantics of the underlying graph, thus assuring the interpretability of the embeddings. 3. Small integers are used for embedded features values: this assures a small memory footprint, hardware acceleration, and a good compression ratio because of the scale-free distribution of node degrees that often characterize biomedical networks 4. "Democratic" feature representation, thus avoiding the bias towards a high degree nodes characteristic of embedding methods based on topological sampling. We present two algorithms based on the ALPINE framework: a) SPINE (Shortest Paths Inferred Node Embedding), based on the efficient computation of the shortest path distance from the landmarks; b) WINE (Windows Inferred Node Embedding) based on the efficient computation of the co-occurrences of each node with the land- marks within windows of a given size during a breadth-first search. The breakthrough scaling properties of the proposed algorithms are shown in experiments with real-world big graphs. GRAPE and ALPINE implementations are available from https://github.com/AnacletoLAB/ grape. The thesis is organized as follows. In chapter 2, the architecture of GRAPE, the graph- processing and graph-representation learning algorithms it provides are described. In chapter 3, we sketch both the open libraries for graph processing and the datasets that have been used to perform the experiments for comparing GRAPE to state-of-the-art works. Chapter 4 describes all the experimental settings and the achieved comparative evaluation results in detail. In chapter 5, we focus on the novel ALPINE algorithmic framework and compare its performance with those of state-of-the-art models. The conclusions summarize the main contributions of the thesis, as well as the drawbacks and limitations of the proposed methods and depict ongoing and future work on scalable GRL methods and their application to Network Medicine.

SCALABLE GRAPH REPRESENTATIONAL LEARNING ALGORITHMS FOR NETWORK MEDICINE / L. Cappelletti ; tutor: G. VALENTINI ; co-supervisore: E. CASIRAGHI ; coordinator: R. SASSI. Dipartimento di Informatica Giovanni Degli Antoni, 2023 Jan 30. 35. ciclo, Anno Accademico 2022.

SCALABLE GRAPH REPRESENTATIONAL LEARNING ALGORITHMS FOR NETWORK MEDICINE

L. Cappelletti
2023

Abstract

In the era of big data, advanced computational techniques are needed to process, analyze and visualize increasing amounts of data generated by high-throughput technologies. In this context, analyzing biomedical Knowledge Graphs that embrace bio- logical and medical concepts structured in ontologies and data generated from high- throughput bio-technologies represents a central Machine Learning and Computational Biology challenge. Indeed several compelling problems in Network Medicine, ranging from gene-disease prioritization to drug-target prediction or drug repurposing can be modelled as node label or edge prediction problems in graphs, where nodes represent bio-medical entities such as genes, drugs or diseases and edges interactions or relationships between them. Recently Graph Representation Learning (GRL) methods opened new possibilities for addressing complex, real-world problems represented by graphs. However, many graphs used in these applications comprise millions of nodes and billions of edges and are be- yond the capabilities of current methods and software implementations. To deal with this open problem, the first contribution of this thesis is the design and development of the GRAPE (Graph Processing and Embedding) resource for GRL, able to scale with big graphs thanks to specialized and innovative data structures and algorithms efficiently implemented through parallel computation. Compared with state-of-the-art software resources, GRAPE shows an improvement of orders of magnitude in empirical space and time complexity, as well as a substantial and statistically significant improvement in edge prediction and node label prediction performance. GRAPE provides over 80, 000 graphs from the literature and other sources, standardized interfaces allowing a straightforward integration of third-party libraries, 61 node embedding methods, 25 inference models, and three modular pipelines to allow a findable, accessible, interoperable, and reusable (FAIR) and reproducible comparison of methods and libraries for graph processing and embedding. GRAPE can quickly generate billions of sampled random walks for random-walk-based GRL algorithms, such as DeepWalk of Node2vec, thus leading to very accurate embedded graphs and boosting machine learning methods' performance that learns from the embedded vector representation of nodes and edges. The scaling properties of GRAPE concerning state-of-the- art resources have been analyzed through extensive experiments with real-world big graphs, including, e.g. Wikipedia, the Comparative Toxicogenomic Database (CTD) and a big biomedical Knowledge Graph generated by PheKnowLator. GRAPE significantly outperforms state-of-the-art libraries in terms of empirical time and space complexity, edge prediction performance and can process big graphs even when the other competing state-of-the-art resources fail. As a second contribution for efficiently processing and analyzing big graphs, this the- sis proposes a novel algorithmic framework, efficiently implemented in GRAPE, that we named ALPINE: Abstract Landmark Properties-Inferred Node Embedding. The breakthrough characteristics of this algorithmic framework allow us to deal with several issues affecting SOTA GRL methods: 1. The embedding features are independently computed, each from the others, thus overcoming the space and time complexity limitations due to their dependent computation. 2. Feature computation is based on the "landmarks", i.e. sets of nodes representing meaningful concepts about the structure or the semantics of the underlying graph, thus assuring the interpretability of the embeddings. 3. Small integers are used for embedded features values: this assures a small memory footprint, hardware acceleration, and a good compression ratio because of the scale-free distribution of node degrees that often characterize biomedical networks 4. "Democratic" feature representation, thus avoiding the bias towards a high degree nodes characteristic of embedding methods based on topological sampling. We present two algorithms based on the ALPINE framework: a) SPINE (Shortest Paths Inferred Node Embedding), based on the efficient computation of the shortest path distance from the landmarks; b) WINE (Windows Inferred Node Embedding) based on the efficient computation of the co-occurrences of each node with the land- marks within windows of a given size during a breadth-first search. The breakthrough scaling properties of the proposed algorithms are shown in experiments with real-world big graphs. GRAPE and ALPINE implementations are available from https://github.com/AnacletoLAB/ grape. The thesis is organized as follows. In chapter 2, the architecture of GRAPE, the graph- processing and graph-representation learning algorithms it provides are described. In chapter 3, we sketch both the open libraries for graph processing and the datasets that have been used to perform the experiments for comparing GRAPE to state-of-the-art works. Chapter 4 describes all the experimental settings and the achieved comparative evaluation results in detail. In chapter 5, we focus on the novel ALPINE algorithmic framework and compare its performance with those of state-of-the-art models. The conclusions summarize the main contributions of the thesis, as well as the drawbacks and limitations of the proposed methods and depict ongoing and future work on scalable GRL methods and their application to Network Medicine.
30-gen-2023
Settore INF/01 - Informatica
VALENTINI, GIORGIO
CASIRAGHI, ELENA
SASSI, ROBERTO
Doctoral Thesis
SCALABLE GRAPH REPRESENTATIONAL LEARNING ALGORITHMS FOR NETWORK MEDICINE / L. Cappelletti ; tutor: G. VALENTINI ; co-supervisore: E. CASIRAGHI ; coordinator: R. SASSI. Dipartimento di Informatica Giovanni Degli Antoni, 2023 Jan 30. 35. ciclo, Anno Accademico 2022.
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R12725.pdf

accesso aperto

Descrizione: Thesis
Tipologia: Altro
Dimensione 15.97 MB
Formato Adobe PDF
15.97 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/949930
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact