Missing value replacement in strings and applications

Bernardini, G.; Liu, C.; Loukides, G.; Marchetti-Spaccamela, A.; Pissis, S.P.; Stougie, L.; Sweering, M.

doi:10.1007/s10618-024-01074-3

Missing values arise routinely in real-world sequential (string) datasets due to: (1)imprecise data measurements; (2) flexible sequence modeling, such as binding pro-files of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length for-bidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.

Missing value replacement in strings and applications / G. Bernardini, C. Liu, G. Loukides, A. Marchetti-Spaccamela, S.P. Pissis, L. Stougie, M. Sweering. - In: DATA MINING AND KNOWLEDGE DISCOVERY. - ISSN 1384-5810. - 39:2(2025 Mar), pp. 12.1-12.50. [10.1007/s10618-024-01074-3]

Missing value replacement in strings and applications

G. Bernardini^Primo;Liu, Chang;Loukides, Grigorios;Marchetti-Spaccamela, Alberto;Pissis, Solon P.;Stougie, Leen;Sweering, Michelle^Ultimo

2025

Abstract

Missing values arise routinely in real-world sequential (string) datasets due to: (1)imprecise data measurements; (2) flexible sequence modeling, such as binding pro-files of molecular sequences; or (3) the existence of confidential information in a dataset which has been deleted deliberately for privacy protection. In order to analyze such datasets, it is often important to replace each missing value, with one or more valid letters, in an efficient and effective way. Here we formalize this task as a combinatorial optimization problem: the set of constraints includes the context of the missing value (i.e., its vicinity) as well as a finite set of user-defined forbidden patterns, modeling, for instance, implausible or confidential patterns; and the objective function seeks to minimize the number of new letters we introduce. Algorithmically, our problem translates to finding shortest paths in special graphs that contain forbidden edges representing the forbidden patterns. Our work makes the following contributions: (1) we design a linear-time algorithm to solve this problem for strings over constant-sized alphabets; (2) we show how our algorithm can be effortlessly applied to fully sanitize a private string in the presence of a set of fixed-length for-bidden patterns [Bernardini et al. 2021a]; (3) we propose a methodology for sanitizing and clustering a collection of private strings that utilizes our algorithm and an effective and efficiently computable distance measure; and (4) we present extensive experimental results showing that our methodology can efficiently sanitize a collection of private strings while preserving clustering quality, outperforming the state of the art and baselines. To arrive at our theoretical results, we employ techniques from formal languages and combinatorial pattern matching.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				string algorithms; forbidden patterns; missing value replacement; string sanitization
			
	Settori scientifico-disciplinari dell'articolo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Titolo del progetto
	
	Titolo Progetto
	
									Algorithmic and Mechanism Design Research in Online MArkets
								
	Acronimo
	
									AMDROMA
								
	Nome finanziatore
	
										European Commission
									
	Finanziamento
	
									Horizon 2020 Framework Programme
								
	N. Contratto
	
									788893
								
	Data di pubblicazione
	
				mar-2025
			
	Rivista in ANCE
	
				DATA MINING AND KNOWLEDGE DISCOVERY
			
	DOI
	
				https://dx.doi.org/10.1007/s10618-024-01074-3
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
Bernardini_et_al-2025-Data_Mining_and_Knowledge_Discovery.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Creative commons Dimensione 3.28 MB Formato Adobe PDF Visualizza/Apri	3.28 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1138695

Citazioni

ND

4

3

1

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca