The Java string classes, String and StringBuffer, lie at the extremes of a spectrum (immutable, reference-based and mutable, content-based). Analogously, available text-search methods on string classes are implemented either as trivial, brute-force double loops, or as very sophisticated and resource-consuming regular-expression search methods. Motivated by our experience in data-intensive text applications, we propose a new string class, MutableString, which tries to get the right balance between extremes in both cases. Mutable strings can be in one of two states, compact and loose, in which they behave more like String and StringBuffer, respectively. Moreover, they support a wide range sophisticated text-search algorithms with a very low resource usage and setup time, using a new, very simple randomised data structure (a generalisation of Bloom filters) that stores an approximation from above of a lattice-valued function. Computing the function value requires a constant number of steps, and the error probability can be balanced with space usage. As a result, we obtain practical implementations of Boyer-Moore type algorithms that can be used with very large alphabets, such as Unicode collation elements. The techniques we develop are very general and amenable to a wide range of applications.

Mutable strings in Java: design, implementation and lightweight text-search algorithms / P. Boldi, S. Vigna. - In: SCIENCE OF COMPUTER PROGRAMMING. - ISSN 0167-6423. - 54:1(2005), pp. 3-23. [10.1016/j.scico.2004.05.003]

Mutable strings in Java: design, implementation and lightweight text-search algorithms

P. Boldi
Primo
;
S. Vigna
Ultimo
2005

Abstract

The Java string classes, String and StringBuffer, lie at the extremes of a spectrum (immutable, reference-based and mutable, content-based). Analogously, available text-search methods on string classes are implemented either as trivial, brute-force double loops, or as very sophisticated and resource-consuming regular-expression search methods. Motivated by our experience in data-intensive text applications, we propose a new string class, MutableString, which tries to get the right balance between extremes in both cases. Mutable strings can be in one of two states, compact and loose, in which they behave more like String and StringBuffer, respectively. Moreover, they support a wide range sophisticated text-search algorithms with a very low resource usage and setup time, using a new, very simple randomised data structure (a generalisation of Bloom filters) that stores an approximation from above of a lattice-valued function. Computing the function value requires a constant number of steps, and the error probability can be balanced with space usage. As a result, we obtain practical implementations of Boyer-Moore type algorithms that can be used with very large alphabets, such as Unicode collation elements. The techniques we develop are very general and amenable to a wide range of applications.
Settore INF/01 - Informatica
2005
Article (author)
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0167642304001005-main.pdf

accesso solo dalla rete interna

Tipologia: Publisher's version/PDF
Dimensione 665.63 kB
Formato Adobe PDF
665.63 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/4688
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 7
  • ???jsp.display-item.citation.isi??? 7
social impact