Scalable Distributed Data Anonymization for Large Datasets

De Capitani di Vimercati, S.; Facchinetti, D.; Foresti, S.; Livraga, G.; Oldani, G.; Paraboschi, S.; Rossi, M.; Samarati, P.

doi:10.1109/TBDATA.2022.3207521

k-Anonymity and l-diversity are two well-known privacy metrics that guarantee protection of the respondents of a dataset by obfuscating information that can disclose their identities and sensitive information. Existing solutions for enforcing them implicitly assume to operate in a centralized scenario, since they require complete visibility over the dataset to be anonymized, and can therefore have limited applicability in anonymizing large datasets. In this paper, we propose a solution that extends Mondrian (an efficient and effective approach designed for achieving k-anonymity) for enforcing both k-anonymity and l-diversity over large datasets in a distributed manner, leveraging the parallel computation of multiple workers. Our approach efficiently distributes the computation among the workers, without requiring visibility over the dataset in its entirety. Our data partitioning limits the need for workers to exchange data, so that each worker can independently anonymize a portion of the dataset. We implemented our approach providing parallel execution on a dynamically chosen number of workers. The experimental evaluation shows that our solution provides scalability, while not affecting the quality of the resulting anonymization.

Scalable Distributed Data Anonymization for Large Datasets / S. De Capitani di Vimercati, D. Facchinetti, S. Foresti, G. Livraga, G. Oldani, S. Paraboschi, M. Rossi, P. Samarati. - In: IEEE TRANSACTIONS ON BIG DATA. - ISSN 2332-7790. - 9:3(2023 Jun 01), pp. 818-831. [10.1109/TBDATA.2022.3207521]

Scalable Distributed Data Anonymization for Large Datasets

S. De Capitani di Vimercati^Primo;D. Facchinetti;S. Foresti;G. Livraga;G. Oldani;S. Paraboschi;M. Rossi;P. Samarati^Ultimo

2023

Abstract

k-Anonymity and l-diversity are two well-known privacy metrics that guarantee protection of the respondents of a dataset by obfuscating information that can disclose their identities and sensitive information. Existing solutions for enforcing them implicitly assume to operate in a centralized scenario, since they require complete visibility over the dataset to be anonymized, and can therefore have limited applicability in anonymizing large datasets. In this paper, we propose a solution that extends Mondrian (an efficient and effective approach designed for achieving k-anonymity) for enforcing both k-anonymity and l-diversity over large datasets in a distributed manner, leveraging the parallel computation of multiple workers. Our approach efficiently distributes the computation among the workers, without requiring visibility over the dataset in its entirety. Our data partitioning limits the need for workers to exchange data, so that each worker can independently anonymize a portion of the dataset. We implemented our approach providing parallel execution on a dynamically chosen number of workers. The experimental evaluation shows that our solution provides scalability, while not affecting the quality of the resulting anonymization.

Scheda breve

Scheda completa

Scheda completa (DC)

	Presenza di coautori internazionali
	
				No
			
	Lingua dell'articolo
	
				English
			
	Parole chiave
	
				Distributed data anonymization; Mondrian; k-Anonymity; l-Diversity; Apache Spark
			
	Settori scientifico-disciplinari dell'articolo
	
				Settore INF/01 - Informatica
			
	Tipo
	
				Articolo
			
	Revisione (peer review)
	
				Esperti anonimi
			
	Classificazione della pubblicazione
	
				Pubblicazione scientifica
			
	Titolo del progetto
	
	Titolo Progetto
	
									Multi-Owner data Sharing for Analytics and Integration respecting Confidentiality and Owner control (MOSAICrOWN)
								
	Acronimo
	
									MOSAICrOWN
								
	Nome finanziatore
	
										EUROPEAN COMMISSION
									
	Finanziamento
	
									H2020
								
	N. Contratto
	
									825333
								
	Titolo Progetto
	
									Green responsibLe privACy preservIng dAta operaTIONs
								
	Acronimo
	
									MARSAL
								
	Nome finanziatore
	
										EUROPEAN COMMISSION
									
	Titolo Progetto
	
									High quality Open data Publishing and Enrichment (HOPE)
								
	Acronimo
	
									HOPE
								
	Nome finanziatore
	
										MINISTERO DELL'ISTRUZIONE E DEL MERITO
									
	N. Contratto
	
									2017MMJJRE_003
								
	Titolo Progetto
	
									Machine Learning-based, Networking and Computing Infrastructure Resource Management of 5G and beyond Intelligent Networks (MARSAL)
								
	Acronimo
	
									GLACIATION
								
	Nome finanziatore
	
										EUROPEAN COMMISSION
									
	Finanziamento
	
									H2020
								
	N. Contratto
	
									101017171
								
	Data di pubblicazione
	
				1-giu-2023
			
	Data ahead of print o data di stampa
	
				mag-2023
			
	Rivista in ANCE
	
				IEEE TRANSACTIONS ON BIG DATA
			
	Editore
	
				Institute of Electrical and Electronics Engineers (IEEE)
			
	Volume o annata
	
				9
			
	Fascicolo
	
				3
			
	Pagina iniziale
	
				818
			
	Pagina finale
	
				831
			
	Numero di pagine
	
				14
			
	Stato di pubblicazione
	
				Pubblicato
			
	Rilevanza del periodico
	
				Periodico con rilevanza internazionale
			
	DOI
	
				https://dx.doi.org/10.1109/TBDATA.2022.3207521
			
	Banca dati sorgente
	
				manual
			
	Identificativo ISI
	
				WOS:000988277900004
			
	Identificativo SCOPUS
	
				2-s2.0-85139441816
			
	Adesione alla policy Open Access di Ateneo
	
				Aderisco
			
	Tipologia
	
				info:eu-repo/semantics/article
			
	Citazione
	
				Scalable Distributed Data Anonymization for Large Datasets / S. De Capitani di Vimercati, D. Facchinetti, S. Foresti, G. Livraga, G. Oldani, S. Paraboschi, M. Rossi, P. Samarati. - In: IEEE TRANSACTIONS ON BIG DATA. - ISSN 2332-7790. - 9:3(2023 Jun 01), pp. 818-831. [10.1109/TBDATA.2022.3207521]
			
	Fulltext
	
				open
			
	Tipologia
	
				Prodotti della ricerca::01 - Articolo su periodico
			
	Numero autori
	
				8
			
	Tipologia sito docente
	
				262
			
	Tipologia
	
				Article (author)
			
	Presenza impact factor
	
				Periodico con Impact Factor
			
	Tutti gli autori
	
						S. De Capitani di Vimercati, D. Facchinetti, S. Foresti, G. Livraga, G. Oldani, S. Paraboschi, M. Rossi, P. Samarati
					
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
Scalable_Distributed_Data_Anonymization_for_Large_Datasets.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 1.53 MB Formato Adobe PDF Visualizza/Apri	1.53 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/940404

Citazioni

ND

1

0

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca