In this paper, we present VATE, the Video-Audio-Text for affective Evaluation dataset. VATE collects a wide variety of multimodal data exhibiting a multitude of spontaneous human affective states. It contains 21,871 raw videos together with voice recordings and text transcriptions from numerous emotion-evoking interviews. VATE is specifically designed for contrastive self-supervised representation learning of human affective states; it prioritises quantity and quality of data over human labelling of emotions, which constitutes a highly subjective, often inconsistent and controversial aspect of modern affective computing. To highlight the usefulness of our proposal, we release a multimodal encoder employing a contrastive video-language-audio pre-training procedure carried out on the VATE dataset. Experimental results show that such model exhibits sensibly better few-shot generalization abilities when compared to fully supervised baselines on different downstream tasks. Data and Code available at: https://github.com/FrancescoAgnelli3/VATE.

VATE: A Large Scale Multimodal Spontaneous Dataset for Affective Evaluation / F. Agnelli, G. Grossi, A. D'Amelio, M. De Paoli, R. Lanzarotti (LECTURE NOTES IN COMPUTER SCIENCE). - In: Computer Vision – ECCV 2024 Workshops / [a cura di] A. Del Bue, C. Canton, J. Pont-Tuset, T. Tommasi. - [s.l] : Springer Science and Business Media Deutschland GmbH, 2025. - ISBN 9783031915741. - pp. 204-222 (( convegno Workshops that were held in conjunction with the 18th European Conference on Computer Vision, ECCV 2024 tenutosi a Milano nel 2024 [10.1007/978-3-031-91575-8_13].

VATE: A Large Scale Multimodal Spontaneous Dataset for Affective Evaluation

F. Agnelli
Primo
;
G. Grossi
Secondo
;
A. D'Amelio;R. Lanzarotti
Ultimo
2025

Abstract

In this paper, we present VATE, the Video-Audio-Text for affective Evaluation dataset. VATE collects a wide variety of multimodal data exhibiting a multitude of spontaneous human affective states. It contains 21,871 raw videos together with voice recordings and text transcriptions from numerous emotion-evoking interviews. VATE is specifically designed for contrastive self-supervised representation learning of human affective states; it prioritises quantity and quality of data over human labelling of emotions, which constitutes a highly subjective, often inconsistent and controversial aspect of modern affective computing. To highlight the usefulness of our proposal, we release a multimodal encoder employing a contrastive video-language-audio pre-training procedure carried out on the VATE dataset. Experimental results show that such model exhibits sensibly better few-shot generalization abilities when compared to fully supervised baselines on different downstream tasks. Data and Code available at: https://github.com/FrancescoAgnelli3/VATE.
affective evaluation; contrastive learning; dataset; unsupervised learning
Settore INFO-01/A - Informatica
2025
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
978-3-031-91575-8_13.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Licenza: Nessuna licenza
Dimensione 2.45 MB
Formato Adobe PDF
2.45 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1171997
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex 0
social impact