Knowledge about the general graph structure of theWorldWideWeb is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we analyze a large web graph. The graph was extracted from a large publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012. The graph covers over 3:5 billion web pages and 128:7 billion hyperlinks. We analyze and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components. We conduct our analysis on three different levels of aggregation: page, host, and pay-level domain (PLD) (one “dot level” above public suffixes). Our analysis shows that, as evidenced by previous research (Serrano et al., 2007), some of the features previously observed by Broder et al., 2000 are very dependent on artifacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers (Donato et al., 2005; Boldi et al., 2002; Baeza-Yates and Poblete, 2003), very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bow-tie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the Web. More importantly, statistical testing and visual inspection of size-rank plots show that the distributions of indegree, outdegree and sizes of strongly connected components of the page and host graph are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. If we aggregate at pay-level domain, however, a power law emerges. We also provide for the first time accurate measurement of distance-based features, using recently introduced algorithms that scale to the size of our crawl (Boldi and Vigna, 2013).

The Graph Structure in the Web : Analyzed on Different Aggregation Levels / R. Meusel, S. Vigna, O. Lehmberg, C. Bizer. - In: THE JOURNAL OF WEB SCIENCE. - ISSN 2332-4031. - 1:1(2015), pp. 33-47. [10.1561/106.00000003]

The Graph Structure in the Web : Analyzed on Different Aggregation Levels

S. Vigna
Primo
;
2015

Abstract

Knowledge about the general graph structure of theWorldWideWeb is important for understanding the social mechanisms that govern its growth, for designing ranking methods, for devising better crawling algorithms, and for creating accurate models of its structure. In this paper, we analyze a large web graph. The graph was extracted from a large publicly accessible web crawl that was gathered by the Common Crawl Foundation in 2012. The graph covers over 3:5 billion web pages and 128:7 billion hyperlinks. We analyze and compare, among other features, degree distributions, connectivity, average distances, and the structure of weakly/strongly connected components. We conduct our analysis on three different levels of aggregation: page, host, and pay-level domain (PLD) (one “dot level” above public suffixes). Our analysis shows that, as evidenced by previous research (Serrano et al., 2007), some of the features previously observed by Broder et al., 2000 are very dependent on artifacts of the crawling process, whereas other appear to be more structural. We confirm the existence of a giant strongly connected component; we however find, as observed by other researchers (Donato et al., 2005; Boldi et al., 2002; Baeza-Yates and Poblete, 2003), very different proportions of nodes that can reach or that can be reached from the giant component, suggesting that the “bow-tie structure” as described by Broder et al. is strongly dependent on the crawling process, and to the best of our current knowledge is not a structural property of the Web. More importantly, statistical testing and visual inspection of size-rank plots show that the distributions of indegree, outdegree and sizes of strongly connected components of the page and host graph are not power laws, contrarily to what was previously reported for much smaller crawls, although they might be heavy tailed. If we aggregate at pay-level domain, however, a power law emerges. We also provide for the first time accurate measurement of distance-based features, using recently introduced algorithms that scale to the size of our crawl (Boldi and Vigna, 2013).
English
world wide web; network analysis; web mining; web graph; power law; bow-tie structure
Settore INF/01 - Informatica
Articolo
Esperti anonimi
Ricerca di base
Pubblicazione scientifica
   New tools and Algorithms for Direct NEtwork analysis
   NADINE
   EUROPEAN COMMISSION
   FP7
   288956
2015
1
1
33
47
15
Pubblicato
Periodico con rilevanza internazionale
crossref
Aderisco
info:eu-repo/semantics/article
The Graph Structure in the Web : Analyzed on Different Aggregation Levels / R. Meusel, S. Vigna, O. Lehmberg, C. Bizer. - In: THE JOURNAL OF WEB SCIENCE. - ISSN 2332-4031. - 1:1(2015), pp. 33-47. [10.1561/106.00000003]
open
Prodotti della ricerca::01 - Articolo su periodico
4
262
Article (author)
Periodico senza Impact Factor
R. Meusel, S. Vigna, O. Lehmberg, C. Bizer
File in questo prodotto:
File Dimensione Formato  
Vigna_JWebScience_2015.pdf

accesso aperto

Tipologia: Publisher's version/PDF
Dimensione 835.25 kB
Formato Adobe PDF
835.25 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/372411
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact