We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects— encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.

Ultra-Large-Scale Repository Analysis via Graph Compression / P. Boldi, A. Pietri, S. Vigna, S. Zacchiroli - In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)Prima edizione. - [s.l] : IEEE, 2020. - ISBN 9781728151434. - pp. 184-194 (( Intervento presentato al 27. convegno IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER) tenutosi a London nel 2020 [10.1109/SANER48275.2020.9054827].

Ultra-Large-Scale Repository Analysis via Graph Compression

P. Boldi;S. Vigna;
2020

Abstract

We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects— encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.
mining software repositories; source code; version control systems; development history; software evolution; graph compression
Settore INF/01 - Informatica
2020
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
09054827.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Dimensione 951.2 kB
Formato Adobe PDF
951.2 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/773875
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 8
social impact