What's Taboo for You? : An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Ferrara, A.; Picascia, S.; Pinnavaia, L.; Ranitovic, V.; Rocchetti, E.; Tuveri, A.

doi:10.48550/arXiv.2507.23319

Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

What's Taboo for You? : An Empirical Evaluation of LLMs Behavior Toward Sensitive Content / A. Ferrara, S. Picascia, L. Pinnavaia, V. Ranitovic, E. Rocchetti, A. Tuveri. - (2025 Jul 31). [10.48550/arXiv.2507.23319]

What's Taboo for You? : An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

A. Ferrara^Primo;S. Picascia^Secondo;L. Pinnavaia;V. Ranitovic;E. Rocchetti^Penultimo;Alice Tuveri

2025

Abstract

Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del pre-print (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Data di depostio del pre-print
	
				31-lug-2025
			
	DOI
	
				https://dx.doi.org/10.48550/arXiv.2507.23319
			
	URL del pre-print
	
				https://arxiv.org/abs/2507.23319
			
	Appare nelle tipologie:
	
				24 - Pre-print

File in questo prodotto:

File	Dimensione	Formato
2507.23319v1.pdf accesso aperto Tipologia: Pre-print (manoscritto inviato all'editore) Licenza: Creative commons Dimensione 762.03 kB Formato Adobe PDF Visualizza/Apri	762.03 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1189256

Citazioni

ND

ND

ND

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca