Large inverted indices are by now common in the construction of web-scale search engines. For faster access, inverted indices are indexed internally so that it is possible to skip quickly over unnecessary documents. The classical approach to skipping dictates that a skip should be positioned every f1/2 document pointers, where f is the overall number of documents where the term appears. We argue that due to the growing size of the web more refined techniques are necessary, and describe how to embed a compressed perfect skip list in an inverted list. We provide statistical models that explain the empirical distribution of the skip data we observe in our experiments, and use them to devise good compression techniques that allow us to limit the waste in space, so that the resulting data structure increases the overall index size by just a few percents, still making it possible to index pointers with a rather fine granularity.

Compressed perfect embedded skip lists for quick inverted-index lookups / Paolo Boldi, Sebastiano Vigna - In: String Processing and Information Retrieval : 12th International Conference, SPIRE 2005, Buenos Aires, Argentina, November 2-4, 2005 : Proceedings / Mariano Consens, Gonzalo Navarro. - Berlin : Springer Verlag, 2005. - ISBN 3540297405. - pp. 25-28 (( Intervento presentato al 12. convegno SPIRE 2005 tenutosi a Buenos Aires nel 2005 [10.1007/11575832_3].

Compressed perfect embedded skip lists for quick inverted-index lookups

P. Boldi;S. Vigna
2005

Abstract

Large inverted indices are by now common in the construction of web-scale search engines. For faster access, inverted indices are indexed internally so that it is possible to skip quickly over unnecessary documents. The classical approach to skipping dictates that a skip should be positioned every f1/2 document pointers, where f is the overall number of documents where the term appears. We argue that due to the growing size of the web more refined techniques are necessary, and describe how to embed a compressed perfect skip list in an inverted list. We provide statistical models that explain the empirical distribution of the skip data we observe in our experiments, and use them to devise good compression techniques that allow us to limit the waste in space, so that the resulting data structure increases the overall index size by just a few percents, still making it possible to index pointers with a rather fine granularity.
Settore INF/01 - Informatica
2005
Book Part (author)
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/4698
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 20
  • ???jsp.display-item.citation.isi??? 13
social impact