Synthetic Data for Identifying Inclusive Language (Case Study: Job Descriptions in Italian)

Romano, T.; Mohammadi, F.; Ceravolo, P.

doi:10.1109/csr61664.2024.10679398

Using a comprehensive list of job titles, we propose a framework to automatically generate job descriptions in Italian. This synthetic data is then used in a Large Language Model to detect inclusive language in job postings. Finally, we compare the results of this synthetic dataset with real data. Our study demonstrates that the data format and prompting method signif-icantly impact performance. Additionally, we identify limitations and key considerations for unifying synthetic data with real data for fine-tuning purposes. We also propose improvements to the framework and provide guidelines for effectively integrating these two types of data. The novelty of our work is generating and integrating synthetic data due to the scarcity of annotated Italian job descriptions, thereby improving the training of Large Language Models (LLMs) tailored specifically for Italian.

Synthetic Data for Identifying Inclusive Language (Case Study: Job Descriptions in Italian) / T. Romano, F. Mohammadi, P. Ceravolo (PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON CYBER SECURITY AND RESILIENCE (CSR)). - In: Proceedings of the 2024 IEEE International Conference on Cyber Security and Resilience (CSR) / [a cura di] S. Shiaeles, N. Kolokotronis, E. Bellini. - [s.l] : IEEE, 2024 Sep. - ISBN 979-8-3503-7536-7. - pp. 737-742 (( convegno IEEE International Conference on Cyber Security and Resilience, CSR tenutosi a London nel 2024 [10.1109/csr61664.2024.10679398].

Synthetic Data for Identifying Inclusive Language (Case Study: Job Descriptions in Italian)

Romano, Tommaso;F. Mohammadi^Secondo;P. Ceravolo^Ultimo

2024

Abstract

Using a comprehensive list of job titles, we propose a framework to automatically generate job descriptions in Italian. This synthetic data is then used in a Large Language Model to detect inclusive language in job postings. Finally, we compare the results of this synthetic dataset with real data. Our study demonstrates that the data format and prompting method signif-icantly impact performance. Additionally, we identify limitations and key considerations for unifying synthetic data with real data for fine-tuning purposes. We also propose improvements to the framework and provide guidelines for effectively integrating these two types of data. The novelty of our work is generating and integrating synthetic data due to the scarcity of annotated Italian job descriptions, thereby improving the training of Large Language Models (LLMs) tailored specifically for Italian.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Titolo del progetto
	
	Titolo Progetto
	
									MUSA - Multilayered Urban Sustainability Actiona
								
	Acronimo
	
									MUSA
								
	Nome finanziatore
	
										MINISTERO DELL'UNIVERSITA' E DELLA RICERCA
									
	Data di pubblicazione
	
				set-2024
			
	Enti collegati al convegno
	
				IEEE Systems, Man, and Cybernetics Society (SMC)
			
	DOI
	
				https://dx.doi.org/10.1109/csr61664.2024.10679398
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
Synthetic_Data_for_Identifying_Inclusive_Language_Case_Study_Job_Descriptions_in_Italian.pdf accesso riservato Descrizione: Conference Paper Tipologia: Publisher's version/PDF Dimensione 871.5 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	871.5 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1119051

Citazioni

ND

0

1

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca