Collections are a fundamental tool for reproducible evaluation of information retrieval techniques. We describe a new method for distributing the document lengths and term counts (a.k.a. within-document frequencies) of a web snapshot in a highly compressed and nonetheless quickly accessible form. Our main application is reproducibility of the behaviour of focused crawlers: by coupling our collection with the corresponding web graph compressed with WebGraph we make it possible to apply text-based machine learning tools to the collection, while keeping the data set footprint small. Finally, we describe a collection based on a crawl of 100Mpages of the .uk domain, publicly available in bundle with a Java open-source implementation of our techniques.
|Titolo:||Compressed collections for simulated crawling|
|Autori interni:||VIGNA, SEBASTIANO (Ultimo)|
|Settore Scientifico Disciplinare:||Settore INF/01 - Informatica|
|Data di pubblicazione:||dic-2008|
|Digital Object Identifier (DOI):||10.1145/1480506.1480512|
|Appare nelle tipologie:||01 - Articolo su periodico|