IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.

BUbiNG: Massive Crawling for the Masses / P. Boldi, A. Marino, M. Santini, S. Vigna. - In: ACM TRANSACTIONS ON THE WEB. - ISSN 1559-1131. - 12:2(2018 Jun 02), pp. 12.12:1-12.12:26. [10.1145/3160017]

BUbiNG: Massive Crawling for the Masses

P. Boldi;Andrea Marino;M. Santini;S. Vigna

2018

Abstract

Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				web crawling; centrality measures; distributed systems
			
	Settori scientifico-disciplinari dell'articolo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Titolo del progetto
	
	Titolo Progetto
	
									New tools and Algorithms for Direct NEtwork analysis
								
	Acronimo
	
									NADINE
								
	Nome finanziatore
	
										EUROPEAN COMMISSION
									
	Finanziamento
	
									FP7
								
	N. Contratto
	
									288956
								
	Data di pubblicazione
	
				2-giu-2018
			
	Rivista in ANCE
	
				ACM TRANSACTIONS ON THE WEB
			
	DOI
	
				https://dx.doi.org/10.1145/3160017
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
BubiING_MassiveCrawlings_ACMTransactions_2018.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 2.6 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	2.6 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/579265

Citazioni

ND

31

24

34

social impact