Vision-Language Models (VLMs) have emerged as powerful tools capable of jointly processing visual and textual information, creating opportunities to replace specialized models in domains such as biometrics. However, as this application remains largely underexplored, prevailing evaluation methods rely on closed-answer Multiple-Choice Questions (MCQ) and parse generated text to extract predictions. To provide an evaluation which better aligns with real-world biometric needs, we introduce an evaluation protocol that bypasses text generation entirely, producing sequence-based predictions directly from output log-probabilities. This approach enables probability-based scoring, which is essential for computing standard biometric metrics. We apply this method to assess the zero-shot capabilities of the Gemma 3 family on face verification, age/gender estimation, and attribute classification, benchmarking them against specialized systems. Furthermore, we demonstrate that traditional MCQ-based evaluations consistently underestimate VLM performance, with our log-probability scoring approach that better captures the identity-specific capabilities of VLMs. Our results show that Gemma 3 models achieve strong performance on classification tasks but struggle with regression, highlighting that a robust methodology is critical to accurately assess the true capabilities and limitations of VLMs in biometrics.

Investigating Vision-Language Models Biometric Capabilities Via Sequence-Based Predictions / R. Donida Labati, A. Ferrara, S. Picascia, V. Piuri, E. Rocchetti, F. Scotti. - (2025 Aug 06). [10.2139/ssrn.5381090]

Investigating Vision-Language Models Biometric Capabilities Via Sequence-Based Predictions

R. Donida Labati
Primo
;
A. Ferrara
Secondo
;
S. Picascia;V. Piuri;E. Rocchetti
Penultimo
;
F. Scotti
Ultimo
2025

Abstract

Vision-Language Models (VLMs) have emerged as powerful tools capable of jointly processing visual and textual information, creating opportunities to replace specialized models in domains such as biometrics. However, as this application remains largely underexplored, prevailing evaluation methods rely on closed-answer Multiple-Choice Questions (MCQ) and parse generated text to extract predictions. To provide an evaluation which better aligns with real-world biometric needs, we introduce an evaluation protocol that bypasses text generation entirely, producing sequence-based predictions directly from output log-probabilities. This approach enables probability-based scoring, which is essential for computing standard biometric metrics. We apply this method to assess the zero-shot capabilities of the Gemma 3 family on face verification, age/gender estimation, and attribute classification, benchmarking them against specialized systems. Furthermore, we demonstrate that traditional MCQ-based evaluations consistently underestimate VLM performance, with our log-probability scoring approach that better captures the identity-specific capabilities of VLMs. Our results show that Gemma 3 models achieve strong performance on classification tasks but struggle with regression, highlighting that a robust methodology is critical to accurately assess the true capabilities and limitations of VLMs in biometrics.
vision-language models; biometrics; zero-shot evaluation
Settore INFO-01/A - Informatica
6-ago-2025
https://ssrn.com/abstract=5381090
File in questo prodotto:
File Dimensione Formato  
ssrn-5381090.pdf

accesso aperto

Tipologia: Pre-print (manoscritto inviato all'editore)
Licenza: Publisher
Dimensione 306.85 kB
Formato Adobe PDF
306.85 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1189257
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact