Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning

Cappelletti, L.; Rekerle, L.; Fontana, T.; Hansen, P.; Casiraghi, E.; Ravanmehr, V.; Mungall, C.J.; Yang, J.; Spranger, L.; Karlebach, G.; Caufield, J.H.; Carmody, L.; Coleman, B.; Oprea, T.; Reese, J.; Valentini, G.; Robinson, P.N.

doi:10.1093/bioadv/vbae036

Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement.

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning / L. Cappelletti, L. Rekerle, T. Fontana, P. Hansen, E. Casiraghi, V. Ravanmehr, C.J. Mungall, J. Yang, L. Spranger, G. Karlebach, J.H. Caufield, L. Carmody, B. Coleman, T. Oprea, J. Reese, G. Valentini, P.N. Robinson. - In: BIOINFORMATICS ADVANCES. - ISSN 2635-0041. - (2024), pp. vbae036.1-vbae036.11. [Epub ahead of print] [10.1093/bioadv/vbae036]

Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning

L. Cappelletti^Primo;Rekerle, Lauren;Fontana, Tommaso;Hansen, Peter;E. Casiraghi;Ravanmehr, Vida;Mungall, Christopher J;Yang, Jeremy;Spranger, Leonard;Karlebach, Guy;Caufield, J Harry;Carmody, Leigh;Coleman, Ben;Oprea, Tudor;Reese, Justin;G. Valentini^Penultimo;Robinson, Peter N

2024

Abstract

Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari dell'articolo
	
				Settore INF/01 - Informatica
Settore MED/01 - Statistica Medica
			
	Data di pubblicazione
	
				2024
			
	Rivista in ANCE
	
				BIOINFORMATICS ADVANCES
			
	DOI
	
				https://dx.doi.org/10.1093/bioadv/vbae036
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
BioInformaticAdvances_nodeDegree.pdf accesso aperto Tipologia: Post-print, accepted manuscript ecc. (versione accettata dall'editore) Dimensione 1.07 MB Formato Adobe PDF Visualizza/Apri	1.07 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1035768

Citazioni

ND

0

0

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca