In this paper, we present VATE, the Video-Audio-Text for affective Evaluation dataset. VATE collects a wide variety of multimodal data exhibiting a multitude of spontaneous human affective states. It contains 21,871 raw videos together with voice recordings and text transcriptions from numerous emotion-evoking interviews. VATE is specifically designed for contrastive self-supervised representation learning of human affective states; it prioritises quantity and quality of data over human labelling of emotions, which constitutes a highly subjective, often inconsistent and controversial aspect of modern affective computing. To highlight the usefulness of our proposal, we release a multimodal encoder employing a contrastive video-language-audio pre-training procedure carried out on the VATE dataset. Experimental results show that such model exhibits sensibly better few-shot generalization abilities when compared to fully supervised baselines on different downstream tasks. Data and Code available at: https://github.com/FrancescoAgnelli3/VATE.
VATE: A Large Scale Multimodal Spontaneous Dataset for Affective Evaluation / F. Agnelli, G. Grossi, A. D'Amelio, M. De Paoli, R. Lanzarotti (LECTURE NOTES IN COMPUTER SCIENCE). - In: Computer Vision – ECCV 2024 Workshops / [a cura di] A. Del Bue, C. Canton, J. Pont-Tuset, T. Tommasi. - [s.l] : Springer Science and Business Media Deutschland GmbH, 2025. - ISBN 9783031915741. - pp. 204-222 (( convegno Workshops that were held in conjunction with the 18th European Conference on Computer Vision, ECCV 2024 tenutosi a Milano nel 2024 [10.1007/978-3-031-91575-8_13].
VATE: A Large Scale Multimodal Spontaneous Dataset for Affective Evaluation
F. Agnelli
Primo
;G. GrossiSecondo
;A. D'Amelio;R. LanzarottiUltimo
2025
Abstract
In this paper, we present VATE, the Video-Audio-Text for affective Evaluation dataset. VATE collects a wide variety of multimodal data exhibiting a multitude of spontaneous human affective states. It contains 21,871 raw videos together with voice recordings and text transcriptions from numerous emotion-evoking interviews. VATE is specifically designed for contrastive self-supervised representation learning of human affective states; it prioritises quantity and quality of data over human labelling of emotions, which constitutes a highly subjective, often inconsistent and controversial aspect of modern affective computing. To highlight the usefulness of our proposal, we release a multimodal encoder employing a contrastive video-language-audio pre-training procedure carried out on the VATE dataset. Experimental results show that such model exhibits sensibly better few-shot generalization abilities when compared to fully supervised baselines on different downstream tasks. Data and Code available at: https://github.com/FrancescoAgnelli3/VATE.| File | Dimensione | Formato | |
|---|---|---|---|
|
978-3-031-91575-8_13.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Licenza:
Nessuna licenza
Dimensione
2.45 MB
Formato
Adobe PDF
|
2.45 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




