Multimodal Deepfake Detection with Large Vision-Language Models: The State of the Art

Abdul Khaliq, A.; Montanelli, S.

doi:10.1007/978-3-032-10192-1_30

The integration of large language models (LLMs) with computer vision (CV) has significantly advanced artificial intelligence (AI), enabling machines to better understand and analyze visual data. In particular, the integration of large vision-language models (LVLMs) has improved cybersecurity efforts, especially in deepfake detection. This survey reviews recent developments in the use of Large Vision-Language Models (LVLMs) for deepfake detection, highlighting their emerging role in addressing the limitations of traditional unimodal techniques. We provide an overview of state-of-the-art LVLM architectures such as CLIP, BLIP2, and LLaVA, and analyze how they enable cross-modal reasoning to identify inconsistencies between visual and textual content. The paper also summarizes key benchmark datasets, outlines ongoing challenges, and identifies open research directions, making it a valuable entry point for researchers exploring multimodal approaches to deepfake detection.

Multimodal Deepfake Detection with Large Vision-Language Models: The State of the Art / A. Abdul Khaliq, S. Montanelli (LECTURE NOTES IN COMPUTER SCIENCE). - In: ICIAP / [a cura di] E. Rodolà, F. Galasso, I. Masi. - [s.l] : Springer, 2026 Sep 15. - ISBN 978-3-032-10192-1. - pp. 364-375 (( 23. International Conference on Image Analysis and Processing, Proceedings Part II : September, 15 – 19 Roma 2025 [10.1007/978-3-032-10192-1_30].

Multimodal Deepfake Detection with Large Vision-Language Models: The State of the Art

A. Abdul Khaliq^{Primo

Writing – Original Draft Preparation};S. Montanelli^{Ultimo

Writing – Review & Editing}

2026

Abstract

The integration of large language models (LLMs) with computer vision (CV) has significantly advanced artificial intelligence (AI), enabling machines to better understand and analyze visual data. In particular, the integration of large vision-language models (LVLMs) has improved cybersecurity efforts, especially in deepfake detection. This survey reviews recent developments in the use of Large Vision-Language Models (LVLMs) for deepfake detection, highlighting their emerging role in addressing the limitations of traditional unimodal techniques. We provide an overview of state-of-the-art LVLM architectures such as CLIP, BLIP2, and LLaVA, and analyze how they enable cross-modal reasoning to identify inconsistencies between visual and textual content. The paper also summarizes key benchmark datasets, outlines ongoing challenges, and identifies open research directions, making it a valuable entry point for researchers exploring multimodal approaches to deepfake detection.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Computer Vision (CV); DeepFake detection; Large Vision Language Models (LVLMs);
			
	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				15-set-2026
			
	DOI
	
				https://dx.doi.org/10.1007/978-3-032-10192-1_30
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
978-3-032-10192-1 (1).pdf accesso riservato Tipologia: Publisher's version/PDF Licenza: Nessuna licenza Dimensione 616.14 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	616.14 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1230676

Citazioni

ND

0

ND

ND

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca