Hide and mine in strings: hardness, algorithms, and experiments

Bernardini, G.; Conte, A.; Gourdel, G.; Grossi, R.; Loukides, G.; Pisanti, N.; Pissis, S.; Punzi, G.; Stougie, L.; Sweering, M.

doi:10.1109/tkde.2022.3158063

Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the process of obtaining all patterns occurring frequently enough in a given dataset. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We also complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well. We show that, unlike popular approaches, our methods can fill missing values in genomic sequences, while preserving the accuracy of frequent pattern mining.

Hide and mine in strings: hardness, algorithms, and experiments / G. Bernardini, A. Conte, G. Gourdel, R. Grossi, G. Loukides, N. Pisanti, S. Pissis, G. Punzi, L. Stougie, M. Sweering. - In: IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. - ISSN 1041-4347. - 35:6(2023 Jun), pp. 5948-5963. [10.1109/tkde.2022.3158063]

Hide and mine in strings: hardness, algorithms, and experiments

G. Bernardini^Primo;Conte, Alessio;Gourdel, Garance;Grossi, Roberto;Loukides, Grigorios;Pisanti, Nadia;Pissis, Solon;Punzi, Giulia;Stougie, Leen;Sweering, Michelle^Ultimo

2023

Abstract

Data sanitization and frequent pattern mining are two well-studied topics in data mining. Data sanitization is the process of disguising (hiding) confidential information in a given dataset. Typically, this process incurs some utility loss that should be minimized. Frequent pattern mining is the process of obtaining all patterns occurring frequently enough in a given dataset. Our work initiates a study on the fundamental relation between data sanitization and frequent pattern mining in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns. This, however, may lead to spurious patterns that harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is as follows. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under realistic assumptions on the input parameters. We also complement the integer linear programming algorithms with a greedy heuristic. Third, we present an extensive experimental study, using both synthetic and real-world datasets, that demonstrates the effectiveness and efficiency of our methods. Beyond sanitization, the process of missing value replacement may also lead to spurious patterns. Interestingly, our results apply in this context as well. We show that, unlike popular approaches, our methods can fill missing values in genomic sequences, while preserving the accuracy of frequent pattern mining.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				data mining; bioinformatics; genomics; DNA; data integrity; privacy; resists; data privacy; data sanitization; knowledge hiding; frequent pattern mining; string algorithms
			
	Settori scientifico-disciplinari dell'articolo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Titolo del progetto
	
	Titolo Progetto
	
									ALgorithms for PAngenome Computational Analysis
								
	Acronimo
	
									ALPACA
								
	Nome finanziatore
	
										European Commission
									
	Finanziamento
	
									Horizon 2020 Framework Programme
								
	N. Contratto
	
									956229
								
	Titolo Progetto
	
									Pan-genome Graph Algorithms and Data Integration
								
	Acronimo
	
									PANGAIA
								
	Nome finanziatore
	
										European Commission
									
	Finanziamento
	
									Horizon 2020 Framework Programme
								
	N. Contratto
	
									872539
								
	Data di pubblicazione
	
				giu-2023
			
	Data ahead of print o data di stampa
	
				10-mar-2022
			
	Rivista in ANCE
	
				IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
			
	DOI
	
				https://dx.doi.org/10.1109/tkde.2022.3158063
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
Hide_and_Mine_in_Strings_Hardness_Algorithms_and_Experiments.pdf accesso riservato Tipologia: Publisher's version/PDF Dimensione 1.08 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.08 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1131528

Citazioni

ND

9

6

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca