CSCAN: FINDING COMMON REGULATORS IN A SET OF GENES USING GENOME-WIDE CHIP-SEQ DATA

Zambelli, F.; Pavesi, G.

ChIP-Seq has rapidly become the method of choice for the genome-wide identification of protein-DNA interactions, especially for those proteins involved in the regulation of gene transcription like transcription factors (TFs) and histones with their post-translational modifications. As a consequence, we have now at our disposal genome-wide maps of the binding of dozens of different TFs or the chromosomal localization of histone modifications in several different cell lines, like the ones produced in the framework of the ENCODE project. But, on the other hand, accessing, retrieving and using those data for further analyses is far from being straightforward. Usually, what is made available to researchers from a ChIP-Seq experiment is a long list of genomic coordinates of those regions found to be enriched for the binding of the protein studied. Genomic browsers like the UCSC Genome Browser permit to retrieve the data and obtain further pieces of information, for example by crossing the regions of the protein studied with other data like gene annotations or other ChIPSeq –derived region lists and finding all the putative gene targets of a given TF, but more involved analyses like computing significant correlations or anti-correlations among different TFs and/or histone modifications and/or gene expression data are far from being straightforward. With these considerations in mind, we retrieved data for about 500 ChIP-Seq experiments of TFs, about 250 of histone modifications and about 100 of other molecules like PolII or CTCF in several different cell lines. We then developed a web-based application called Cscan, permitting to study their correlations with gene expression data or their respective correlations. Cscan accepts as input a list of gene identifiers, and outputs, for each of the experiments available the enrichment of the TF or histone in the promoter region of the input genes. Enrichment is computed with an exact Fisher test taking into account: a) the overall number of genes annotated in the genome b) the overall number of genes of the genome whose promoter contains a region enriched in the experiment c) the number of input genes d) the number of input genes whose promoter is enriched in the experiment. There are several different possible applications for a tool of this kind, like for example: 1. Given a set of co-expressed genes, finding their common regulators, by looking for TFs with over-represented binding regions in their promoters; 2. Given a ChIP-Seq experiment for a TF and its target genes, find which other TFs target the same set of genes, and to which extent; that is, discover correlations or anti-correlations between the TFs or between the TF and histone modifications; 3. Given a ChIP-Seq experiment for a TF (or a histone modification) in a given cell line, and the list of genes transcribed in the same cell line (often available by genome-wide maps of PolII binding) , discover whether the TF can be hypothesized to be a transcriptional activator or repressor. Cscan can be accessed through a user-friendly Web interface. A beta version is available at http://www.beaconlab.it/cscan. At the moment data available are from human experiments but we plan to add other species in the next future.

CSCAN: FINDING COMMON REGULATORS IN A SET OF GENES USING GENOME-WIDE CHIP-SEQ DATA / F. Zambelli, G. Pavesi. ((Intervento presentato al 3. convegno Next Generation Sequencing Workshop tenutosi a Bari nel 2011.