An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an O~ (nmω-1) + O(N) -time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ω denotes the matrix multiplication exponent and the O~ (· ) notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in O(k2mG+ kN) time under edit distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k= 1, the existing algorithm runs in Ω(mN) time in the worst case. Here we make progress in this direction. We show that 1-EDSM can be solved in O((nm2+ N) log m) or O(nm3+ N) time under edit distance. For the decision version of the problem, we present a faster O(nm2logm+Nloglogm) -time algorithm. Our algorithms rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or range emptiness), which we show how to solve efficiently.
Elastic-Degenerate String Matching with 1 Error / G. Bernardini, E. Gabory, S.P. Pissis, L. Stougie, M. Sweering, W. Zuba (LECTURE NOTES IN COMPUTER SCIENCE). - In: LATIN 2022: Theoretical Informatics / [a cura di] A. Castañeda, F. Rodríguez-Henríquez. - [s.l] : Springer Science and Business Media Deutschland GmbH, 2022. - ISBN 9783031206238. - pp. 20-37 (( Intervento presentato al 15. convegno Latin American Symposium on Theoretical Informatics tenutosi a Guanajuato nel 2022 [10.1007/978-3-031-20624-5_2].
Elastic-Degenerate String Matching with 1 Error
G. Bernardini
Primo
;
2022
Abstract
An elastic-degenerate (ED) string is a sequence of n finite sets of strings of total length N, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length m in an ED text. The EDSM problem has recently received some attention by the combinatorial pattern matching community, culminating in an O~ (nmω-1) + O(N) -time algorithm [Bernardini et al., SIAM J. Comput. 2022], where ω denotes the matrix multiplication exponent and the O~ (· ) notation suppresses polylog factors. In the k-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most k errors. k-EDSM can be solved in O(k2mG+ kN) time under edit distance, where G denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately, G is only bounded by N, and so even for k= 1, the existing algorithm runs in Ω(mN) time in the worst case. Here we make progress in this direction. We show that 1-EDSM can be solved in O((nm2+ N) log m) or O(nm3+ N) time under edit distance. For the decision version of the problem, we present a faster O(nm2logm+Nloglogm) -time algorithm. Our algorithms rely on non-trivial reductions from 1-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or range emptiness), which we show how to solve efficiently.File | Dimensione | Formato | |
---|---|---|---|
978-3-031-20624-5_2.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Dimensione
527.25 kB
Formato
Adobe PDF
|
527.25 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.