Development of machine learning methods for the discrimination between coding and non-coding conserved sequences

Re', M.

In the last ten years, numerous complete and almost complete genome sequences have been made available to the research community but the completion of the inventory of coding genes of eukaryotic genomes at least has proved an elusive goal. Classical ab-initio gene prediction methods have been invaluable in the annotation of genome sequences, but show notable weaknesses with respect to genes with unusual structural features, while annotation on the basis of similarity to known genes does not allow the detection of genuinely novel genes. The identification of sequences under evolutionary constraint by means of comparison of genome sequences is a powerful technique for inferring the locations of functional elements in a genome. As whole-genome sequencing efforts extend beyond traditional model organisms to include a wide diversity of species, comparative genomics analyses will be further empowered to reveal insights into genomes and their evolution. The discovery and annotation of functional genomic elements is a necessary step toward a detailed understanding of genome biology, and sequence comparisons have been demonstrated to be an integral tool for this task. In recent years, an ever increasing amount of evidence suggests that, despite initial assumptions, a large proportion of the sequences conserved between related genomes do not represent coding regions. Other experiments also demonstrate that the classical opinion that long stretches of conserved genomic sequences are predominantly protein coding regions has to be revised due to the presence of long conserved non coding functional elements such as modular clusters of well conserved transcription factor binding sites. Thus the discrimination between conserved coding and non-coding sequences is an important objective for comparative genomics. Single statistics such as the synonymous versus non-synonymous substitution ratio can be used, in isolation, to establish whether conserved regions are likely to be protein-coding or non coding. However, such approaches tend to exhibit either low sensitivity (with high specificity) or high sensitivity (at the cost of low specificity). This may be in part because the extent and nature of selective pressures acting during evolution on protein-coding sequences are not only inhomogeneous between different organisms but also between genes belonging to the same organism. More than six years after the completion of the human genome sequence our inability to correctly classify as coding or non-coding the entire set of sequences conserved between human and many other organisms (ranging from closely related mammalian species such as mouse, rat and dog to fish and birds) clearly indicates a lack of understanding of the mechanisms underlying the molecular evolution of many classes of genomic elements. This is particularly true for non-coding functional elements, likely because they play a more diverse range of functional roles (and thus evolve under more diverse and complex constraints) than initially appreciated. Given this lack of knowledge of molecular evolution, the development of reliable methods for the discrimination between conserved coding and non coding sequences is complicated by the absence of tests aimed to detect evolutionary dynamics associated with conserved non-coding regions. One possible solution, in the absence of novel insights into the evolution of non coding sequences, is a more general and effective use of the well known evolutionary patterns characterizing protein coding sequences in union with methods based on the concept of ‘learning by induction’ which underlies tools such as Neural Networks and Support Vectors Machines (SVM), classifiers able to ‘learn’ to discriminate between instances belonging to two or classes from the study of well annotated training sets and to produce a model suitable for the classification of previously unseen instances. The work presented here shows analyses performed on sets of genomic elements conserved between human and mouse to critically investigate the effectiveness of many types of information in functional classification. We first investigated if the local density of conserved sequences presenting evolutionary dynamics compatible with a protein coding genomic region can be exploited to detect the presence of regions potentially containing unannotated genes. The results we obtained from the analysis of our Conserved Sequences Tags (CSTs) set show that even if the clustering of conserved sequences with high coding potential is a good indicator of the presence of unannotated genes, sensitivity is compromised by both potential errors in the classification of CSTs and the precision required in the definition of genomic regions to be potentially targeted by classical molecular biology experiments (given the associated costs), resulting in the choice of very stringent cut off values for the designation of clusters. We then moved our focus to the improvement of current methods to discriminate between coding and non coding conserved regions with the aim of reach an optimal trade off between sensitivity and false positive rate in the classification of locally conserved genomic regions. Methods were developed using carefully controlled datasets of sequences to minimize the effect of errors in sequence annotation. To achieve this goal we cross-compared different databases to select fully consistent conserved sequences with and without supporting evidence of translation, removing sequences biased by annotation errors. We have implemented several novel measures of coding potential and demonstrate that they are capable of reasonable accuracy in the discrimination of coding and non-coding alignments. We show that impressive levels of both sensitivity and accuracy can be obtained by more effective use of many statistics each modeling different aspects of the evolutionary dynamics of protein coding sequences. Accordingly, we have implemented a probabilistic system based on the assessment of the statistical relevance of the predictions produced by a SVM-based classifier which uses multiple measures of coding potential of aligned sequences. We demonstrate that this system outperforms currently available methods for the classification of aligned genomic sequences and transcripts. There is the lack of a genomic functional annotation tools based on a modular set of feature eventually expandable by the user and exploiting statistical and information theories more flexible than Hidden Markov Models and covering experimental conditions in which HMMs are not effective. The work presented here will hopefully fill this gap through the production of a general and flexible tools specifically designed for the functional assessment of conserved genomic regions. These tools offer the opportunity to expand the set of known genes/exons in well studied organisms, to refine current annotations and to explore newly sequenced genomes.

Development of machine learning methods for the discrimination between coding and non-coding conserved sequences / M. Re' ; relatore: Carmela Gissi ; coordinatore: Giuliana Zanetti. DIPARTIMENTO DI SCIENZE BIOMOLECOLARI E BIOTECNOLOGIE, 2007. 20. ciclo, Anno Accademico 2006/2007.