Ultra-Large-Scale Repository Analysis via Graph Compression

Boldi, P.; Pietri, A.; Vigna, S.; Zacchiroli, S.

doi:10.1109/SANER48275.2020.9054827

We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects— encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.

Ultra-Large-Scale Repository Analysis via Graph Compression / P. Boldi, A. Pietri, S. Vigna, S. Zacchiroli - In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)Prima edizione. - [s.l] : IEEE, 2020. - ISBN 9781728151434. - pp. 184-194 (( Intervento presentato al 27. convegno IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER) tenutosi a London nel 2020 [10.1109/SANER48275.2020.9054827].

Ultra-Large-Scale Repository Analysis via Graph Compression

P. Boldi;Pietri, Antoine;S. Vigna;Zacchiroli, Stefano

2020

Abstract

We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects— encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
			mining software repositories; source code; version control systems; development history; software evolution; graph compression
		
	Settori scientifico-disciplinari del contributo
	
			Settore INF/01 - Informatica
		
	Data di pubblicazione
	
			2020
		
	DOI
	
			https://dx.doi.org/10.1109/SANER48275.2020.9054827
		
	Tipologia
	
			Book Part (author)
		
	Appare nelle tipologie:
	
			03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
09054827.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 951.2 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	951.2 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/773875

Citazioni

ND

10

8

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca