Motivation (1401) MitoZoa (MZ) is a specialized database collecting complete and nearly complete mitochondrial genomes (mtDNA) of Metazoa, and focused on the correction of the numerous annotation inaccuracies affecting mt entries. Indeed, these inaccuracies can prevent truthful analyses of some under-investigated mt features, such as gene order (GO) and non-coding regions (NCR), or the correct retrieval of gene sequences such as those for tRNAs and rRNAs. MZ is coupled to a reannotation pipeline that allows identification and correction of entry errors, and to an automatic protocol for standardizations of gene and NCR names, as standardization is a prerequisite for the implementation of fast and easy retrieval of data/sequences on GO, NCR, congeneric species data, and other curated mt features. MZ can be queried at www.caspur.it/mitozoa with a user-friendly interface and, since its publication (Lupi et al 2010), it has been accessed more than 6,000 times in 14 months. With the aim to guarantee a regular updating and improve the quality of stored data, we have refined/added several steps to the reannotation pipeline embedded in MZ. Among novelties, we have assessed GO comparisons both to identify annotation errors, especially in congeneric species, and to built a database of “reference GO”, that can help to investigate the dynamics and evolution of GO at both short and long phylogenetic distances. Methods (1793) The MZ reannotation pipeline v.2.0 is made of several Python scripts: 14 scripts for validation/rectification of entry annotation elements and for MZ update, and 5 scripts aimed at MZ management, statistics calculation, NCR annotation and GO string generation/manipulation. The bi-monthly MZ update includes the addition of newly published mt entries, selected automatically through a specific query to EMBL, and, as novelty, the check of possible changes in RefSeq/EMBL entries corresponding to pre-existing MZ entries, especially at level of sequence, feature table and organism classification fields. Other novelties of the MZ reannotation pipeline v.2.0 are: 1) the identification and annotation standardization of mt pseudogenes and events of editing/translational frameshift in protein genes; 2) the creation of “simulation files” (sim-file), i.e. tab-delimited files having coherent structure and standard syntax, where a record of all identified errors and related reannotation events is saved before real modification of the entries. The sim-file can be easily edited also from human, thus it permits to manually modify the most troublesome annotations and to keep the whole flow of data under control; 3) the introduction of GO comparisons for reannotation purposes: all entries of congeneric species, belonging to both the updating dataset and the current MZ release, are inspected for the presence of difference in gene order/content and inconsistencies are verified by literature, if necessary. Finally, the full set of GOs has been clustered on the basis of their string-identity: the resulting “non-redundant” dataset permits to measure the diversity of GO among and within the major metazoan lineages. The implementation of a GO similarity search method is currently in progress. Results (1716) MitoZoa has been updated six times since its construction, thus the current Rel.7 (Jan 2011) contains 2,259 complete and 374 partial mtDNAs, with a total of 755 new entries compared to Rel.1 (Jul 2009). On the whole, 62% of the Rel.7 entries have been reannotated, and 76% of all reannotation events involve tRNAs. Thanks to the integration of the GO comparison procedure in the reannotation pipeline, we have identified several cases of different GO at congeneric level. Most of them have been confirmed by literature check and/or manual analyses but cases of erroneous annotation have been also identified and corrected, including two entries where the observed different GO in congeneric species was due to incorrect species affiliation, i.e., inaccurate indication of the organism name. GO survey in congeneric species also indicates Tunicata, Enoplea and Gastropoda as the metazoan lineages with the highest number of different congeneric GO. Finally, the GO clustering procedure identifies 404 different GO found in a total of 2,259 complete mtDNAs, and classified in 47 “shared GO”, i.e., present in more than one species, and 357 “singletons”, i.e., present in only one species. In the poster, we will discuss the distribution of shared and singleton GO in the major metazoan lineages: for example, in Craniata the most widespread GO, the “Craniata standard”, is common to 78.4% of the species but there are also 53 different singletons. Each of four groups of Metatheria, Aves, Crocodylidae and Hyperoartia has its own GO slightly different from the “Craniata standard”, while Amphibia Neobatrachia, Serpentes and Anguilliformes fishes show high GO variability and many singletons, even at congeneric level.

More is better, different is helpful : MitoZoa database improvements and the usage of mitochondrial gene order diversity as reannotation criterion / R. Lupi, P. D’Onorio De Meo, M. D’Antonio, G. Pavesi, F. Griggio, G. Pesole, T. Castrignanò, C. Gissi - In: BITS 2011, VIII Annual Meeting of the Bioinformatics Italian Society / [a cura di] F. Geraci, R. Marangoni, M. Pellegrini, M.E. Renda. - Pisa : Edizioni ETS, 2011 Jun 20. - ISBN 978-884673069-5. - pp. 117-118 (( Intervento presentato al 8. convegno BITS : Annual Meeting of the Bioinformatics Italian Society tenutosi a Pisa nel 2011.

More is better, different is helpful : MitoZoa database improvements and the usage of mitochondrial gene order diversity as reannotation criterion

R. Lupi
Primo
;
G. Pavesi;F. Griggio;G. Pesole;C. Gissi
Ultimo
2011

Abstract

Motivation (1401) MitoZoa (MZ) is a specialized database collecting complete and nearly complete mitochondrial genomes (mtDNA) of Metazoa, and focused on the correction of the numerous annotation inaccuracies affecting mt entries. Indeed, these inaccuracies can prevent truthful analyses of some under-investigated mt features, such as gene order (GO) and non-coding regions (NCR), or the correct retrieval of gene sequences such as those for tRNAs and rRNAs. MZ is coupled to a reannotation pipeline that allows identification and correction of entry errors, and to an automatic protocol for standardizations of gene and NCR names, as standardization is a prerequisite for the implementation of fast and easy retrieval of data/sequences on GO, NCR, congeneric species data, and other curated mt features. MZ can be queried at www.caspur.it/mitozoa with a user-friendly interface and, since its publication (Lupi et al 2010), it has been accessed more than 6,000 times in 14 months. With the aim to guarantee a regular updating and improve the quality of stored data, we have refined/added several steps to the reannotation pipeline embedded in MZ. Among novelties, we have assessed GO comparisons both to identify annotation errors, especially in congeneric species, and to built a database of “reference GO”, that can help to investigate the dynamics and evolution of GO at both short and long phylogenetic distances. Methods (1793) The MZ reannotation pipeline v.2.0 is made of several Python scripts: 14 scripts for validation/rectification of entry annotation elements and for MZ update, and 5 scripts aimed at MZ management, statistics calculation, NCR annotation and GO string generation/manipulation. The bi-monthly MZ update includes the addition of newly published mt entries, selected automatically through a specific query to EMBL, and, as novelty, the check of possible changes in RefSeq/EMBL entries corresponding to pre-existing MZ entries, especially at level of sequence, feature table and organism classification fields. Other novelties of the MZ reannotation pipeline v.2.0 are: 1) the identification and annotation standardization of mt pseudogenes and events of editing/translational frameshift in protein genes; 2) the creation of “simulation files” (sim-file), i.e. tab-delimited files having coherent structure and standard syntax, where a record of all identified errors and related reannotation events is saved before real modification of the entries. The sim-file can be easily edited also from human, thus it permits to manually modify the most troublesome annotations and to keep the whole flow of data under control; 3) the introduction of GO comparisons for reannotation purposes: all entries of congeneric species, belonging to both the updating dataset and the current MZ release, are inspected for the presence of difference in gene order/content and inconsistencies are verified by literature, if necessary. Finally, the full set of GOs has been clustered on the basis of their string-identity: the resulting “non-redundant” dataset permits to measure the diversity of GO among and within the major metazoan lineages. The implementation of a GO similarity search method is currently in progress. Results (1716) MitoZoa has been updated six times since its construction, thus the current Rel.7 (Jan 2011) contains 2,259 complete and 374 partial mtDNAs, with a total of 755 new entries compared to Rel.1 (Jul 2009). On the whole, 62% of the Rel.7 entries have been reannotated, and 76% of all reannotation events involve tRNAs. Thanks to the integration of the GO comparison procedure in the reannotation pipeline, we have identified several cases of different GO at congeneric level. Most of them have been confirmed by literature check and/or manual analyses but cases of erroneous annotation have been also identified and corrected, including two entries where the observed different GO in congeneric species was due to incorrect species affiliation, i.e., inaccurate indication of the organism name. GO survey in congeneric species also indicates Tunicata, Enoplea and Gastropoda as the metazoan lineages with the highest number of different congeneric GO. Finally, the GO clustering procedure identifies 404 different GO found in a total of 2,259 complete mtDNAs, and classified in 47 “shared GO”, i.e., present in more than one species, and 357 “singletons”, i.e., present in only one species. In the poster, we will discuss the distribution of shared and singleton GO in the major metazoan lineages: for example, in Craniata the most widespread GO, the “Craniata standard”, is common to 78.4% of the species but there are also 53 different singletons. Each of four groups of Metatheria, Aves, Crocodylidae and Hyperoartia has its own GO slightly different from the “Craniata standard”, while Amphibia Neobatrachia, Serpentes and Anguilliformes fishes show high GO variability and many singletons, even at congeneric level.
Settore BIO/11 - Biologia Molecolare
20-giu-2011
Book Part (author)
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/169549
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact