On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets

Koohi Esfahani, M.; Boldi, P.; Vandierendonck, H.; Kilpatrick, P.; Vigna, S.

doi:10.1109/BigData59044.2023.10386309

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets. To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly available real-world edge-weighted graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently. The largest graph is created by matching (i.e., all-toall similarity aligning) 1.7 billion protein sequences. The MSBioGraphs family includes also seven subgraphs with different sizes and direction types. We describe two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique. The datasets are available online on https://blogs.qub.ac.uk/ DIPSA/MS-BioGraphs.

On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets / M. Koohi Esfahani, P. Boldi, H. Vandierendonck, P. Kilpatrick, S. Vigna - In: 2023 IEEE International Conference on Big Data (BigData)[s.l] : IEEE, 2023. - ISBN 979-8-3503-2445-7. - pp. 215-220 (( convegno International Conference on Big Data (BigData) tenutosi a Sorrento nel 2023 [10.1109/BigData59044.2023.10386309].

On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets

Koohi Esfahani, Mohsen;P. Boldi^Secondo;Vandierendonck, Hans;Kilpatrick, Peter;S. Vigna^Ultimo

2023

Abstract

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets. To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly available real-world edge-weighted graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently. The largest graph is created by matching (i.e., all-toall similarity aligning) 1.7 billion protein sequences. The MSBioGraphs family includes also seven subgraphs with different sizes and direction types. We describe two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique. The datasets are available online on https://blogs.qub.ac.uk/ DIPSA/MS-BioGraphs.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Big Data Management and Processing; Graph Datasets; High-Performance Computing; Biological Networks; Sequence Similarity Graph; Graph Algorithms
			
	Settori scientifico-disciplinari del contributo (sola visualizzazione)
	
				Settore INF/01 - Informatica
			
	Data di pubblicazione
	
				2023
			
	DOI
	
				https://dx.doi.org/10.1109/BigData59044.2023.10386309
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
On_Overcoming_HPC_Challenges_of_Trillion-Scale_Real-World_Graph_Datasets.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 251.96 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	251.96 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1026109

Citazioni

ND

1

ND

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca