Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques

Cappelletti, L.; Petrini, A.; Gliozzo, J.; Casiraghi, E.; Schubach, M.; Kircher, M.; Valentini, G.

doi:10.1186/s12859-022-04582-5

Background: Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific cell-lines/tissues and the dynamic cell response to physiological stimuli. Recent studies showed that genetic variants occurring in CRRs are strongly correlated with pathogenicity or deleterious- ness. Considering the central role of CRRs in the regulation of physiological and patho- logical conditions, the correct identification of CRRs and of their tissue-specific activity status through Machine Learning methods plays a major role in dissecting the impact of genetic variants on human diseases. Unfortunately, the problem is still open, though some promising results have been already reported by (deep) machine-learning based methods that predict active promoters and enhancers in specific tissues or cell lines by encoding epigenetic or spectral features directly extracted from DNA sequences. Results: We present the experiments we performed to compare two Deep Neural Networks, a Feed-Forward Neural Network model working on epigenomic features, and a Convolutional Neural Network model working only on genomic sequence, targeted to the identification of enhancer- and promoter-activity in specific cell lines. While performing experiments to understand how the experimental setup influences the prediction performance of the methods, we particularly focused on (1) automatic model selection performed by Bayesian optimization and (2) exploring different data rebalancing setups for reducing negative unbalancing effects. Conclusions: Results show that (1) automatic model selection by Bayesian optimiza- tion improves the quality of the learner; (2) data rebalancing considerably impacts the prediction performance of the models; test set rebalancing may provide over-optimis- tic results, and should therefore be cautiously applied; (3) despite working on sequence data, convolutional models obtain performance close to those of feed forward models working on epigenomic information, which suggests that also sequence data carries informative content for CRR-activity prediction. We therefore suggest combining both models/data types in future works.

Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques / L. Cappelletti, A. Petrini, J. Gliozzo, E. Casiraghi, M. Schubach, M. Kircher, G. Valentini. - In: BMC BIOINFORMATICS. - ISSN 1471-2105. - 23:2(2022 Dec 12), pp. 154.1-154.32. [10.1186/s12859-022-04582-5]

Boosting tissue-specific prediction of active cis-regulatory regions through deep learning and Bayesian optimization techniques

L. Cappelletti^Primo;A. Petrini^Secondo;J. Gliozzo;E. Casiraghi;Schubach, Max;Kircher, Martin;G. Valentini^Ultimo

2022

Abstract

Background: Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific cell-lines/tissues and the dynamic cell response to physiological stimuli. Recent studies showed that genetic variants occurring in CRRs are strongly correlated with pathogenicity or deleterious- ness. Considering the central role of CRRs in the regulation of physiological and patho- logical conditions, the correct identification of CRRs and of their tissue-specific activity status through Machine Learning methods plays a major role in dissecting the impact of genetic variants on human diseases. Unfortunately, the problem is still open, though some promising results have been already reported by (deep) machine-learning based methods that predict active promoters and enhancers in specific tissues or cell lines by encoding epigenetic or spectral features directly extracted from DNA sequences. Results: We present the experiments we performed to compare two Deep Neural Networks, a Feed-Forward Neural Network model working on epigenomic features, and a Convolutional Neural Network model working only on genomic sequence, targeted to the identification of enhancer- and promoter-activity in specific cell lines. While performing experiments to understand how the experimental setup influences the prediction performance of the methods, we particularly focused on (1) automatic model selection performed by Bayesian optimization and (2) exploring different data rebalancing setups for reducing negative unbalancing effects. Conclusions: Results show that (1) automatic model selection by Bayesian optimiza- tion improves the quality of the learner; (2) data rebalancing considerably impacts the prediction performance of the models; test set rebalancing may provide over-optimis- tic results, and should therefore be cautiously applied; (3) despite working on sequence data, convolutional models obtain performance close to those of feed forward models working on epigenomic information, which suggests that also sequence data carries informative content for CRR-activity prediction. We therefore suggest combining both models/data types in future works.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Neural networks; Deep learning; Prediction of cis-regulatory region; Bayesian optimization
			
	Settori scientifico-disciplinari dell'articolo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Settori scientifico-disciplinari dell'articolo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				12-dic-2022
			
	Rivista in ANCE
	
				BMC BIOINFORMATICS
			
	DOI
	
				https://dx.doi.org/10.1186/s12859-022-04582-5
			
	URL
	
				https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04582-5
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
BMC_CRR_prediction.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 5.18 MB Formato Adobe PDF Visualizza/Apri	5.18 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/948254

Citazioni

ND

3

3

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca