The growing capacity to handle vast amounts of data, combined with a shift in ser- vice delivery models, has improved scalability and efficiency in data analytics, par- ticularly in multi-tenant environments. Data are treated as digital products and pro- cessed through orchestrated service-based data pipelines. However, advancements in data analytics do not find a counterpart in data governance techniques, leaving a gap in the effective management of data throughout the pipeline lifecycle. This gap highlights the need for innovative service-based data pipeline management solutions that prioritize balancing data quality and data protection. The framework proposed in this paper optimizes service selection and composition within service- based data pipelines to maximize data quality while ensuring compliance with data protection requirements, expressed as access control policies. Given the NP- hard nature of the problem, a sliding-window heuristic is defined and evaluated against the exhaustive approach and a baseline modeling the state of the art. Our results demonstrate a significant reduction in computational overhead, while maintain- ing high data quality.

Maximizing data quality while ensuring data protection in service-based data pipelines / A. Polimeno, C. Braghin, M. Anisetti, C.A. Ardagna. - In: JOURNAL OF BIG DATA. - ISSN 2196-1115. - 12:1(2025 Dec), pp. 62.1-62.34. [10.1186/s40537-025-01118-5]

Maximizing data quality while ensuring data protection in service-based data pipelines

A. Polimeno
Primo
;
C. Braghin
Secondo
;
M. Anisetti
Penultimo
;
C.A. Ardagna
Ultimo
2025

Abstract

The growing capacity to handle vast amounts of data, combined with a shift in ser- vice delivery models, has improved scalability and efficiency in data analytics, par- ticularly in multi-tenant environments. Data are treated as digital products and pro- cessed through orchestrated service-based data pipelines. However, advancements in data analytics do not find a counterpart in data governance techniques, leaving a gap in the effective management of data throughout the pipeline lifecycle. This gap highlights the need for innovative service-based data pipeline management solutions that prioritize balancing data quality and data protection. The framework proposed in this paper optimizes service selection and composition within service- based data pipelines to maximize data quality while ensuring compliance with data protection requirements, expressed as access control policies. Given the NP- hard nature of the problem, a sliding-window heuristic is defined and evaluated against the exhaustive approach and a baseline modeling the state of the art. Our results demonstrate a significant reduction in computational overhead, while maintain- ing high data quality.
Access control; Big data; Data protection; Data quality; Privacy; Service-based data pipelines
Settore INFO-01/A - Informatica
   MUSA - Multilayered Urban Sustainability Actiona
   MUSA
   MINISTERO DELL'UNIVERSITA' E DELLA RICERCA

   BA-PHERD: Big Data Analytics Pipeline for the Identification of Heterogeneous Extracellular non-coding RNAs as Disease Biomarkers
   BA-PHERD
   MINISTERO DELL'UNIVERSITA' E DELLA RICERCA
   2022XABBMA_002

   SEcurity and RIghts in the CyberSpace (SERICS)
   SERICS
   MINISTERO DELL'UNIVERSITA' E DELLA RICERCA
   codice identificativo PE00000014
dic-2025
10-mar-2025
Article (author)
File in questo prodotto:
File Dimensione Formato  
s40537-025-01118-5.pdf

accesso aperto

Descrizione: Research
Tipologia: Publisher's version/PDF
Dimensione 2.93 MB
Formato Adobe PDF
2.93 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1154615
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact