The c-MYC oncogene encodes the transcription factor Myc, which regulates a large number of biological processes and is overexpressed in a large number of cancers. When overexpressed, Myc binds to almost all open promoters but only regulates specific subsets of genes. We investigated this issue in three systems where Myc is overexpressed: 3T9MycER fibroblasts, Eµ-myc B cells and tet-MYC liver cells, through an approach integrating different types of next generation sequencing data, such as DNase-seq footprinting, ChIP-seq and RNA-seq, with motif analysis and machine learning methods (random forest). In particular, the DNase-seq technique can detect genome-wide open chromatin regions (DNase hypersensitive sites or DHSs) and sites where a transcription factor (TF) is bound (footprints). In order to analyse the DNase-seq footprinting data in our systems, we developed a novel pipeline that carries out a step-by-step analysis of the raw DNase-seq data, and outputs DHS and TF footprints. To select the best footprint caller for the pipeline we carried out a benchmarking study comparing two footprint calling algorithms DNaseR and Wellington on ENCODE data. The Wellington algorithm, scored consistently best both in terms of specificity and sensitivity and therefore it was chosen for our pipeline. We overlapped genome wide the footprints identified by the pipeline with matches of a PWM library, obtaining a list of footprinted PWMs. Then, we used this list as a series of features to carry out pairwise classifications of the upregulated, downregulated and not-deregulated subsets of genes in the three systems. A PWM that classifies the data with a large enough Area Under the Curve (AUC) pointed to a TF possibly selectively binding with Myc in a subset of genes only. We first applied a single feature classifier assessing the performance of each of the PWMs one by one, and we found that single PWMs only provided a limited classification of the gene subsets. We then turned to a random forest classifier that considers combinations of all the features. This strategy provided a good separation of the data sets (AUC>0.7) and identified some candidates, such as Nrf1/Nrf2 (Eµ-myc T up), Tead factors (Eµ-myc T and tet-MYC up), E2f4 (Eµ-myc T up) and E2f1(Eµ-myc T and tet-MYC up), that could potentially act with Myc in regulating specific subsets of genes.

AN INTEGRATIVE APPROACH TO IDENTIFY BINDING PARTNERS OF MYC USING (EPI)GENOMICS DATA IN THE 3T9MYCER, EU-MYC AND TET-MYC SYSTEMS / P. Bora ; added supervisor: M. Morelli, P. Di Fiore ; supervisor: B. Amati. UNIVERSITA' DEGLI STUDI DI MILANO, 2017 Mar 02. 28. ciclo, Anno Accademico 2016. [10.13130/bora-pranami_phd2017-03-02].

AN INTEGRATIVE APPROACH TO IDENTIFY BINDING PARTNERS OF MYC USING (EPI)GENOMICS DATA IN THE 3T9MYCER, EU-MYC AND TET-MYC SYSTEMS

P. Bora
2017

Abstract

The c-MYC oncogene encodes the transcription factor Myc, which regulates a large number of biological processes and is overexpressed in a large number of cancers. When overexpressed, Myc binds to almost all open promoters but only regulates specific subsets of genes. We investigated this issue in three systems where Myc is overexpressed: 3T9MycER fibroblasts, Eµ-myc B cells and tet-MYC liver cells, through an approach integrating different types of next generation sequencing data, such as DNase-seq footprinting, ChIP-seq and RNA-seq, with motif analysis and machine learning methods (random forest). In particular, the DNase-seq technique can detect genome-wide open chromatin regions (DNase hypersensitive sites or DHSs) and sites where a transcription factor (TF) is bound (footprints). In order to analyse the DNase-seq footprinting data in our systems, we developed a novel pipeline that carries out a step-by-step analysis of the raw DNase-seq data, and outputs DHS and TF footprints. To select the best footprint caller for the pipeline we carried out a benchmarking study comparing two footprint calling algorithms DNaseR and Wellington on ENCODE data. The Wellington algorithm, scored consistently best both in terms of specificity and sensitivity and therefore it was chosen for our pipeline. We overlapped genome wide the footprints identified by the pipeline with matches of a PWM library, obtaining a list of footprinted PWMs. Then, we used this list as a series of features to carry out pairwise classifications of the upregulated, downregulated and not-deregulated subsets of genes in the three systems. A PWM that classifies the data with a large enough Area Under the Curve (AUC) pointed to a TF possibly selectively binding with Myc in a subset of genes only. We first applied a single feature classifier assessing the performance of each of the PWMs one by one, and we found that single PWMs only provided a limited classification of the gene subsets. We then turned to a random forest classifier that considers combinations of all the features. This strategy provided a good separation of the data sets (AUC>0.7) and identified some candidates, such as Nrf1/Nrf2 (Eµ-myc T up), Tead factors (Eµ-myc T and tet-MYC up), E2f4 (Eµ-myc T up) and E2f1(Eµ-myc T and tet-MYC up), that could potentially act with Myc in regulating specific subsets of genes.
2-mar-2017
Settore BIO/11 - Biologia Molecolare
Myc ; Epigenomics ; Transcription factors ; DNase I hypersensitivity ; footprints
AMATI, BRUNO
Doctoral Thesis
AN INTEGRATIVE APPROACH TO IDENTIFY BINDING PARTNERS OF MYC USING (EPI)GENOMICS DATA IN THE 3T9MYCER, EU-MYC AND TET-MYC SYSTEMS / P. Bora ; added supervisor: M. Morelli, P. Di Fiore ; supervisor: B. Amati. UNIVERSITA' DEGLI STUDI DI MILANO, 2017 Mar 02. 28. ciclo, Anno Accademico 2016. [10.13130/bora-pranami_phd2017-03-02].
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R10337.pdf

accesso aperto

Descrizione: Thesis
Tipologia: Tesi di dottorato completa
Dimensione 9.55 MB
Formato Adobe PDF
9.55 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/471632
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact