A Differentially Private Data Structure for Substring and Document Counting

Bernardini, G.; Bille, P.; Gørtz, I.L.; Steiner, T.A.

doi:10.1145/3810900.3810908

For databases consisting of many text documents, one of the most fundamental data analysis tasks is counting (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). If such a database contains sensitive data, it is crucial to protect the privacy of individuals in the database. Differential privacy is the gold standard for privacy in data analysis. It gives rigorous privacy guarantees, but comes at the cost of yielding less accurate results. In this paper, we study the problem of substring and document counting under differential privacy. We give the first differentially private data structures for these problems and provide bounds on their additive error. For ε-differential privacy, we show that the error of our data structure is optimal up to a poly-logarithmic factor in the number of documents and length of the longest document. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and q-grams.

A Differentially Private Data Structure for Substring and Document Counting / G. Bernardini, P. Bille, I.L. Gørtz, T.A. Steiner. - In: SIGMOD RECORD. - ISSN 0163-5808. - 55:1(2026 Apr 26), pp. 41-49. [10.1145/3810900.3810908]

A Differentially Private Data Structure for Substring and Document Counting

G. Bernardini^Primo;Bille, Philip;Gørtz, Inge Li;Steiner, Teresa Anna

2026

Abstract

For databases consisting of many text documents, one of the most fundamental data analysis tasks is counting (i) how often a pattern appears as a substring in the database (substring counting) and (ii) how many documents in the collection contain the pattern as a substring (document counting). If such a database contains sensitive data, it is crucial to protect the privacy of individuals in the database. Differential privacy is the gold standard for privacy in data analysis. It gives rigorous privacy guarantees, but comes at the cost of yielding less accurate results. In this paper, we study the problem of substring and document counting under differential privacy. We give the first differentially private data structures for these problems and provide bounds on their additive error. For ε-differential privacy, we show that the error of our data structure is optimal up to a poly-logarithmic factor in the number of documents and length of the longest document. Our data structures immediately lead to improved algorithms for related problems, such as privately mining frequent substrings and q-grams.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Differential privacy; Document counting; Substring counting; String mining; Frequent pattern mining
			
	Settori scientifico-disciplinari dell'articolo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				26-apr-2026
			
	Rivista in ANCE
	
				SIGMOD RECORD
			
	DOI
	
				https://dx.doi.org/10.1145/3810900.3810908
			
	Tipologia
	
				Article (author)
			
	Appare nelle tipologie:
	
				01 - Articolo su periodico

File in questo prodotto:

File	Dimensione	Formato
A Differentially Private Data Structure for Substring and Document Counting.pdf accesso aperto Tipologia: Publisher's version/PDF Licenza: Non specificato Dimensione 9.64 MB Formato Adobe PDF Visualizza/Apri	9.64 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1240735

Citazioni

ND

ND

ND

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca