The integration of large language models (LLMs) with computer vision (CV) has significantly advanced artificial intelligence (AI), enabling machines to better understand and analyze visual data. In particular, the integration of large vision-language models (LVLMs) has improved cybersecurity efforts, especially in deepfake detection. This survey reviews recent developments in the use of Large Vision-Language Models (LVLMs) for deepfake detection, highlighting their emerging role in addressing the limitations of traditional unimodal techniques. We provide an overview of state-of-the-art LVLM architectures such as CLIP, BLIP2, and LLaVA, and analyze how they enable cross-modal reasoning to identify inconsistencies between visual and textual content. The paper also summarizes key benchmark datasets, outlines ongoing challenges, and identifies open research directions, making it a valuable entry point for researchers exploring multimodal approaches to deepfake detection.
Multimodal Deepfake Detection with Large Vision-Language Models: The State of the Art / A. Abdul Khaliq, S. Montanelli (LECTURE NOTES IN COMPUTER SCIENCE). - In: ICIAP / [a cura di] E. Rodolà, F. Galasso, I. Masi. - [s.l] : Springer, 2026 Sep 15. - ISBN 978-3-032-10192-1. - pp. 364-375 (( 23. International Conference on Image Analysis and Processing, Proceedings Part II : September, 15 – 19 Roma 2025 [10.1007/978-3-032-10192-1_30].
Multimodal Deepfake Detection with Large Vision-Language Models: The State of the Art
A. Abdul Khaliq
Primo
Writing – Original Draft Preparation
;S. MontanelliUltimo
Writing – Review & Editing
2026
Abstract
The integration of large language models (LLMs) with computer vision (CV) has significantly advanced artificial intelligence (AI), enabling machines to better understand and analyze visual data. In particular, the integration of large vision-language models (LVLMs) has improved cybersecurity efforts, especially in deepfake detection. This survey reviews recent developments in the use of Large Vision-Language Models (LVLMs) for deepfake detection, highlighting their emerging role in addressing the limitations of traditional unimodal techniques. We provide an overview of state-of-the-art LVLM architectures such as CLIP, BLIP2, and LLaVA, and analyze how they enable cross-modal reasoning to identify inconsistencies between visual and textual content. The paper also summarizes key benchmark datasets, outlines ongoing challenges, and identifies open research directions, making it a valuable entry point for researchers exploring multimodal approaches to deepfake detection.| File | Dimensione | Formato | |
|---|---|---|---|
|
978-3-032-10192-1 (1).pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Licenza:
Nessuna licenza
Dimensione
616.14 kB
Formato
Adobe PDF
|
616.14 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




