A generalised degenerate string (GD string) Š is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote the sum of these lengths k0, k1, . . . , kn-1 by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W, n2}N)-time algorithm for computing all palindromes in Š. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in Š. Finally, proof-of-concept experimental results are presented using real protein datasets.

Degenerate String Comparison and Applications / M. Alzamel, L.A.K. Ayad, G. Bernardini, R. Grossi, C.S. Iliopoulos, N. Pisanti, S.P. Pissis, G. Rosone (LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS). - In: 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)[s.l] : Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2018. - ISBN 9783959770828. - pp. 1-14 (( Intervento presentato al 18. convegno International Workshop on Algorithms in Bioinformatics tenutosi a Helsinki nel 2018 [10.4230/lipics.wabi.2018.21].

Degenerate String Comparison and Applications

G. Bernardini
;
2018

Abstract

A generalised degenerate string (GD string) Š is a sequence of n sets of strings of total size N, where the ith set contains strings of the same length ki but this length can vary between different sets. We denote the sum of these lengths k0, k1, . . . , kn-1 by W. This type of uncertain sequence can represent, for example, a gapless multiple sequence alignment of width W in a compact form. Our first result in this paper is an O(N+M)-time algorithm for deciding whether the intersection of two GD strings of total sizes N and M, respectively, over an integer alphabet, is non-empty. This result is based on a combinatorial result of independent interest: although the intersection of two GD strings can be exponential in the total size of the two strings, it can be represented in only linear space. A similar result can be obtained by employing an automata-based approach but its cost is alphabet-dependent. We then apply our string comparison algorithm to compute palindromes in GD strings. We present an O(min{W, n2}N)-time algorithm for computing all palindromes in Š. Furthermore, we show a similar conditional lower bound for computing maximal palindromes in Š. Finally, proof-of-concept experimental results are presented using real protein datasets.
Degenerate strings; Elastic-degenerate strings; Generalised degenerate strings; Palindromes; String comparison
Settore INFO-01/A - Informatica
2018
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
LIPIcs.WABI.2018.21.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 544.68 kB
Formato Adobe PDF
544.68 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1131920
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 21
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact