In this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click frequencies. We provide a possible formal definition for the notion of essential web pages as those that cover a large fraction of distinct queries --- i.e., we look at the problem as a version of MaxCover. Although in general MaxCover is approximable to within a factor of 1-1/e ~0.632 from the optimum, we provide a condition under which the greedy algorithm does find the actual best cover (or remains at a known bounded factor from it). The extra check for optimality (or for bounding the ratio from the optimum) comes at a negligible algorithmic cost. Moreover, in most practical instances of this problem, the algorithm is able to provide solutions that are provably optimal, or close to optimal. We relate this observed phenomenon to some properties of the queries' click graph. Our experimental results confirm that a small number of web pages can respond to a large fraction of the queries (e.g., 0.4% of the pages answers 20% of the queries). Our approach can be used in several related search applications, and has in fact an even more general appeal --- as a first example, our preliminary experimental study confirms that our algorithm has extremely good performances on other (social network based) MaxCover instances.
Essential Web Pages Are Easy to Find / R.A. Baeza Yates, P. Boldi, F. Chierichetti - In: WWW '15: Proceedings / [a cura di] A. Gangemi, S. Leonardi, A. Panconesi. - [s.l] : ACM, 2015 May 18. - ISBN 9781450334693. - pp. 97-107 (( Intervento presentato al 24. convegno International Conference on World Wide Web tenutosi a Firenze nel 2015 [10.1145/2736277.2741100].
Essential Web Pages Are Easy to Find
P. Boldi
;
2015
Abstract
In this paper we address the problem of estimating the index size needed by web search engines to answer as many queries as possible by exploiting the marked difference between query and click frequencies. We provide a possible formal definition for the notion of essential web pages as those that cover a large fraction of distinct queries --- i.e., we look at the problem as a version of MaxCover. Although in general MaxCover is approximable to within a factor of 1-1/e ~0.632 from the optimum, we provide a condition under which the greedy algorithm does find the actual best cover (or remains at a known bounded factor from it). The extra check for optimality (or for bounding the ratio from the optimum) comes at a negligible algorithmic cost. Moreover, in most practical instances of this problem, the algorithm is able to provide solutions that are provably optimal, or close to optimal. We relate this observed phenomenon to some properties of the queries' click graph. Our experimental results confirm that a small number of web pages can respond to a large fraction of the queries (e.g., 0.4% of the pages answers 20% of the queries). Our approach can be used in several related search applications, and has in fact an even more general appeal --- as a first example, our preliminary experimental study confirms that our algorithm has extremely good performances on other (social network based) MaxCover instances.| File | Dimensione | Formato | |
|---|---|---|---|
|
p97-baeza-yates.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Dimensione
713.45 kB
Formato
Adobe PDF
|
713.45 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




