The integration of large language models (LLMs) with computer vision (CV) has significantly advanced artificial intelligence (AI), enabling machines to better understand and analyze visual data. In particular, the integration of large vision-language models (LVLMs) has improved cybersecurity efforts, especially in deepfake detection. This survey reviews recent developments in the use of Large Vision-Language Models (LVLMs) for deepfake detection, highlighting their emerging role in addressing the limitations of traditional unimodal techniques. We provide an overview of state-of-the-art LVLM architectures such as CLIP, BLIP2, and LLaVA, and analyze how they enable cross-modal reasoning to identify inconsistencies between visual and textual content. The paper also summarizes key benchmark datasets, outlines ongoing challenges, and identifies open research directions, making it a valuable entry point for researchers exploring multimodal approaches to deepfake detection.

Multimodal Deepfake Detection with Large Vision-Language Models: The State of the Art / A. Abdul Khaliq, S. Montanelli (LECTURE NOTES IN COMPUTER SCIENCE). - In: ICIAP / [a cura di] E. Rodolà, F. Galasso, I. Masi. - [s.l] : Springer, 2026 Sep 15. - ISBN 978-3-032-10192-1. - pp. 364-375 (( 23. International Conference on Image Analysis and Processing, Proceedings Part II : September, 15 – 19 Roma 2025 [10.1007/978-3-032-10192-1_30].

Multimodal Deepfake Detection with Large Vision-Language Models: The State of the Art

A. Abdul Khaliq
Primo
Writing – Original Draft Preparation
;
S. Montanelli
Ultimo
Writing – Review & Editing
2026

Abstract

The integration of large language models (LLMs) with computer vision (CV) has significantly advanced artificial intelligence (AI), enabling machines to better understand and analyze visual data. In particular, the integration of large vision-language models (LVLMs) has improved cybersecurity efforts, especially in deepfake detection. This survey reviews recent developments in the use of Large Vision-Language Models (LVLMs) for deepfake detection, highlighting their emerging role in addressing the limitations of traditional unimodal techniques. We provide an overview of state-of-the-art LVLM architectures such as CLIP, BLIP2, and LLaVA, and analyze how they enable cross-modal reasoning to identify inconsistencies between visual and textual content. The paper also summarizes key benchmark datasets, outlines ongoing challenges, and identifies open research directions, making it a valuable entry point for researchers exploring multimodal approaches to deepfake detection.
Computer Vision (CV); DeepFake detection; Large Vision Language Models (LVLMs);
Settore INFO-01/A - Informatica
15-set-2026
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
978-3-032-10192-1 (1).pdf

accesso riservato

Tipologia: Publisher's version/PDF
Licenza: Nessuna licenza
Dimensione 616.14 kB
Formato Adobe PDF
616.14 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1230676
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact