In addressing the limited availability of data for predictive purposes with machine learning, we are concerned with potential biases arising from dataset augmentation. Despite advanced algorithms to generate synthetic data that can preserve the original data distribution, challenges remain, including the risk of perpetuating social biases. Our approach uses a similarity network representation that treats each data point as a node and strategically generates synthetic points near it. A vector label propagation algorithm, complemented by an exponential kernel for adjusting link weights, accurately labels these synthetic points. The primary goal is to reduce the system’s dependence on sensitive features without excluding them, thereby avoiding the risk of exacerbating biases or reducing data variation. Implemented in a big data ecosystem, our methodology enables continuous evaluation in an evolving domain, effectively addressing the challenges of data scarcity with a fairness-aware approach.
A Novel Assurance Procedure for Fair Data Augmentation in Machine Learning / S. Maghool, P. Ceravolo, F. Berto - In: AIEB 2024 : Workshop on Implementing AI Ethics through a Behavioural Lens 2024 / [a cura di] L. Nannini, A. Gillard, C. Friedman Levy, A. Ozkes, M. Slavkovik. - [s.l] : CEUR-WS, 2025 Apr 08. - pp. 25-36 (( Intervento presentato al 26. convegno European Conference on Artificial Intelligence tenutosi a Santiago de Compostela nel 2024.
A Novel Assurance Procedure for Fair Data Augmentation in Machine Learning
S. Maghool
Primo
;P. CeravoloSecondo
;F. BertoUltimo
2025
Abstract
In addressing the limited availability of data for predictive purposes with machine learning, we are concerned with potential biases arising from dataset augmentation. Despite advanced algorithms to generate synthetic data that can preserve the original data distribution, challenges remain, including the risk of perpetuating social biases. Our approach uses a similarity network representation that treats each data point as a node and strategically generates synthetic points near it. A vector label propagation algorithm, complemented by an exponential kernel for adjusting link weights, accurately labels these synthetic points. The primary goal is to reduce the system’s dependence on sensitive features without excluding them, thereby avoiding the risk of exacerbating biases or reducing data variation. Implemented in a big data ecosystem, our methodology enables continuous evaluation in an evolving domain, effectively addressing the challenges of data scarcity with a fairness-aware approach.| File | Dimensione | Formato | |
|---|---|---|---|
|
paper3.pdf
accesso aperto
Tipologia:
Publisher's version/PDF
Dimensione
1.42 MB
Formato
Adobe PDF
|
1.42 MB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




