Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants

Schubach, M.; Re, M.; Robinson, P.N.; Valentini, G.

doi:10.1038/s41598-017-03011-5

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants / M. Schubach, M. Re, P.N. Robinson, G. Valentini. - In: SCIENTIFIC REPORTS. - ISSN 2045-2322. - 7:1(2017 Jun 07), pp. 2959.1-2959.12. [10.1038/s41598-017-03011-5]

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants

M. Schubach;M. Re^Secondo;P. N. Robinson;G. Valentini^Ultimo

2017

Abstract

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
			Genome Informatics; Machine Learning; Predictive Medicine; Personalized Medicine; Deleterious genetic variant prediction
		
	Settori scientifico-disciplinari dell'articolo
	
			Settore INF/01 - Informatica
Settore ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Settore BIO/18 - Genetica
Settore MED/03 - Genetica Medica
		
	Data di pubblicazione
	
			7-giu-2017
		
	Rivista in ANCE
	
			SCIENTIFIC REPORTS
		
	DOI
	
			https://dx.doi.org/10.1038/s41598-017-03011-5
		
	Tipologia
	
			Article (author)
		
	Appare nelle tipologie:
	
			01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
Imbalance-aware-SREP.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 2.42 MB Formato Adobe PDF Visualizza/Apri	2.42 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/504053

Citazioni

18

66

49

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca