We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects— encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.
Ultra-Large-Scale Repository Analysis via Graph Compression / P. Boldi, A. Pietri, S. Vigna, S. Zacchiroli - In: 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)Prima edizione. - [s.l] : IEEE, 2020. - ISBN 9781728151434. - pp. 184-194 (( Intervento presentato al 27. convegno IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER) tenutosi a London nel 2020 [10.1109/SANER48275.2020.9054827].
Ultra-Large-Scale Repository Analysis via Graph Compression
P. Boldi;S. Vigna;
2020
Abstract
We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects— encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.File | Dimensione | Formato | |
---|---|---|---|
09054827.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Dimensione
951.2 kB
Formato
Adobe PDF
|
951.2 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.