Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants / M. Schubach, M. Re, P.N. Robinson, G. Valentini. - In: SCIENTIFIC REPORTS. - ISSN 2045-2322. - 7:1(2017 Jun 07), pp. 2959.1-2959.12. [10.1038/s41598-017-03011-5]

Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants

M. Re
Secondo
;
G. Valentini
Ultimo
2017

Abstract

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.
Genome Informatics; Machine Learning; Predictive Medicine; Personalized Medicine; Deleterious genetic variant prediction
Settore INF/01 - Informatica
Settore ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Settore BIO/18 - Genetica
Settore MED/03 - Genetica Medica
7-giu-2017
Article (author)
File in questo prodotto:
File Dimensione Formato  
Imbalance-aware-SREP.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 2.42 MB
Formato Adobe PDF
2.42 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/504053
Citazioni
  • ???jsp.display-item.citation.pmc??? 18
  • Scopus 66
  • ???jsp.display-item.citation.isi??? 49
social impact