IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) -- nullspace projection and counterfactual flipping -- on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations \emph{between} the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite -- an intriguing distinction that warrants further investigation in future work.

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP / E. Rocchetti, A.F.. - (2026 Jun 11). [10.48550/arXiv.2606.13720]

Refusal Beyond a Single Direction: A Preliminary Comparison of Diff-in-Means and INLP

E. Rocchetti^Primo;A. Ferrara^Ultimo

2026

Abstract

Arditi et al. (2024) has shown that refusal in safety fine-tuned chat models is mediated by a single linear direction in the residual stream, recoverable by a difference-in-means (DiM) of harmful and harmless activations. We compare DiM-based interventions (activation addition and directional ablation) with two interventions derived from Iterative Nullspace Projection (INLP) -- nullspace projection and counterfactual flipping -- on five open-weight chat models, asking whether INLP can match DiM at steering refusal and whether its richer parameterisation yields more tweakable interventions. INLP counterfactual flipping is competitive with DiM directional ablation on refusal suppression, while nullspace projection is consistently weaker. Restricting INLP to the leading directions of the extracted subspace preserves most of the suppression effect at near-baseline perplexity, giving a tunable capability. Geometrically, the two INLP interventions land in qualitatively different regions of activation space: nullspace projection collapses transformed activations \emph{between} the harmful and harmless clusters, while counterfactual flipping moves them into the opposite cluster, suggesting that the model encodes the absence of a concept differently from its opposite -- an intriguing distinction that warrants further investigation in future work.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Computer Science - Artificial Intelligence; Computer Science - Artificial Intelligence
			
	Settori scientifico-disciplinari del pre-print (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Data di depostio del pre-print
	
				11-giu-2026
			
	DOI
	
				https://dx.doi.org/10.48550/arXiv.2606.13720
			
	URL del pre-print
	
				http://arxiv.org/abs/2606.13720v1
			
	Appare nelle tipologie:
	
				24 - Pre-print

File in questo prodotto:

File	Dimensione	Formato
2606.13720v1.pdf accesso aperto Descrizione: paper Tipologia: Pre-print (manoscritto inviato all'editore) Licenza: Creative commons Dimensione 3.73 MB Formato Adobe PDF Visualizza/Apri	3.73 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1255895

Citazioni

ND

ND

ND

ND

social impact