IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca

Being able to understand the inner workings of Large Language Models (LLMs) is crucial for ensuring safer development practices and fostering trust in their predictions, particularly in sensitive applications. Causal Mediation Analysis (CMA) is a causality framework which fits perfectly for this scenario, providing a mechanistic interpretation of the behaviour of LLM components and assessing a specific type of knowledge in the model (e.g. presence of gender bias). This study discusses the challenges and potential pathways in applying CMA to open LLMs’ black boxes. Through three exemplary case studies from the literature, we show the unique insights CMA can provide. We elaborate on the inherent challenges and opportunities this approach presents. These challenges range from the influence of model architecture on prompt viability to the complexities of ensuring metric comparability across studies. Conversely, the opportunities lie in the dissection of LLMs’ knowledge through the extraction of the specific domains of knowledge activated during processing. Our discussion aims to provide a comprehensive insight into CMA, focusing on essential aspects to equip researchers with the knowledge necessary for crafting effective CMA experiments tailored towards interpretability objectives.

Causal Mediation Analysis for Interpreting Large Language Models / E. Rocchetti, A. Ferrara (CEUR WORKSHOP PROCEEDINGS). - In: SEBD 2024 : Symposium on Advanced Database Systems 2024 / [a cura di] M. Atzori, P. Ciaccia, M. Ceci, F. Mandreoli, D. Malerba, M. Sanguinetti, A. Pellicani, F. Motta. - [s.l] : CEUR-WS, 2024. - pp. 585-594 (( convegno SEBD 2024 Symposium on Advanced Database Systems 2024 tenutosi a Villasimius nel 2024.

Causal Mediation Analysis for Interpreting Large Language Models

E. Rocchetti^Primo;A. Ferrara^Secondo

2024

Abstract

Being able to understand the inner workings of Large Language Models (LLMs) is crucial for ensuring safer development practices and fostering trust in their predictions, particularly in sensitive applications. Causal Mediation Analysis (CMA) is a causality framework which fits perfectly for this scenario, providing a mechanistic interpretation of the behaviour of LLM components and assessing a specific type of knowledge in the model (e.g. presence of gender bias). This study discusses the challenges and potential pathways in applying CMA to open LLMs’ black boxes. Through three exemplary case studies from the literature, we show the unique insights CMA can provide. We elaborate on the inherent challenges and opportunities this approach presents. These challenges range from the influence of model architecture on prompt viability to the complexities of ensuring metric comparability across studies. Conversely, the opportunities lie in the dissection of LLMs’ knowledge through the extraction of the specific domains of knowledge activated during processing. Our discussion aims to provide a comprehensive insight into CMA, focusing on essential aspects to equip researchers with the knowledge necessary for crafting effective CMA experiments tailored towards interpretability objectives.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				LLM; interpretability; causality; causal mediation analysis
			
	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				2024
			
	URL
	
				https://ceur-ws.org/Vol-3741/paper39.pdf
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
CEUR Vol 3741 Paper 39.pdf accesso aperto Tipologia: Publisher's version/PDF Dimensione 561.81 kB Formato Adobe PDF Visualizza/Apri	561.81 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1144896

Citazioni

ND

0

ND

social impact