Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems, and, at the same time, scales linearly with the amount of resources available. This article aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors’ experience with UbiCrawler [9] and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousand pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols to achieve very high throughput.
BUbiNG: Massive Crawling for the Masses / P. Boldi, A. Marino, M. Santini, S. Vigna. - In: ACM TRANSACTIONS ON THE WEB. - ISSN 1559-1131. - 12:2(2018 Jun 02), pp. 12.12:1-12.12:26.
Titolo: | BUbiNG: Massive Crawling for the Masses |
Autori: | VIGNA, SEBASTIANO (Corresponding) |
Parole Chiave: | web crawling; centrality measures; distributed systems |
Settore Scientifico Disciplinare: | Settore INF/01 - Informatica |
Progetto: | New tools and Algorithms for Direct NEtwork analysis |
Data di pubblicazione: | 2-giu-2018 |
Rivista: | |
Tipologia: | Article (author) |
Digital Object Identifier (DOI): | http://dx.doi.org/10.1145/3160017 |
Appare nelle tipologie: | 01 - Articolo su periodico |
File in questo prodotto:
File | Descrizione | Tipologia | Licenza | |
---|---|---|---|---|
BubiING_MassiveCrawlings_ACMTransactions_2018.pdf | Publisher's version/PDF | Administrator Richiedi una copia |