This study investigates the impact of incorporating multimodal affective signals into foundation models, focusing on whether such integration enhances their ability to represent emotional content. We compare task-agnostic foundation models with those pretrained specifically on affective data to assess the benefits of affective pretraining. Additionally, we examine how multimodal pretraining influences the robustness and generalizability of affective representations, emphasizing the role of cross-modal learning. In particular, we explore how pretraining on different combinations of modalities—audio-text, audio-video, video-text, and all three—affects the quality of representations learned for each modality, especially when certain modalities are absent during downstream emotion recognition tasks. Our experiments across multiple benchmarks demonstrate that incorporating additional modalities during pretraining consistently improves performance, even when those modalities are unavailable at inference time. Notably, models pretrained on all three modalities achieve the highest generalizability, showing significant gains in both low-resource settings and related tasks such as humor detection. These findings highlight the value of cross-modal knowledge transfer and underscore the importance of modality-rich pretraining for building robust, adaptable emotion-aware systems.

A Study on Multimodal Foundation Models for Affective Video Prediction / F. Agnelli, M. Ditroia, G. Blandano, A. D'Amelio, O. Ghezzi, M. De Paoli, R. Lanzarotti (LECTURE NOTES IN COMPUTER SCIENCE). - In: Image Analysis and Processing – ICIAP 2025 / [a cura di] E. Rodolà, F. Galasso, I. Masi. - [s.l] : Springer Nature Switzerland, 2025 Sep 15. - ISBN 9783032101914. - pp. 224-235 (( 23. International Conference Roma 2025 [10.1007/978-3-032-10192-1_19].

A Study on Multimodal Foundation Models for Affective Video Prediction

F. Agnelli
;
G. Blandano;A. D'Amelio;R. Lanzarotti
2025

Abstract

This study investigates the impact of incorporating multimodal affective signals into foundation models, focusing on whether such integration enhances their ability to represent emotional content. We compare task-agnostic foundation models with those pretrained specifically on affective data to assess the benefits of affective pretraining. Additionally, we examine how multimodal pretraining influences the robustness and generalizability of affective representations, emphasizing the role of cross-modal learning. In particular, we explore how pretraining on different combinations of modalities—audio-text, audio-video, video-text, and all three—affects the quality of representations learned for each modality, especially when certain modalities are absent during downstream emotion recognition tasks. Our experiments across multiple benchmarks demonstrate that incorporating additional modalities during pretraining consistently improves performance, even when those modalities are unavailable at inference time. Notably, models pretrained on all three modalities achieve the highest generalizability, showing significant gains in both low-resource settings and related tasks such as humor detection. These findings highlight the value of cross-modal knowledge transfer and underscore the importance of modality-rich pretraining for building robust, adaptable emotion-aware systems.
Affective computing; Multimodal learning; Cross-modal learning; Foundation models; Emotion recognition; Contrastive learning
Settore INFO-01/A - Informatica
15-set-2025
Book Part (author)
File in questo prodotto:
File Dimensione Formato  
978-3-032-10192-1_19.pdf

accesso riservato

Tipologia: Publisher's version/PDF
Licenza: Nessuna licenza
Dimensione 1.62 MB
Formato Adobe PDF
1.62 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1210236
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex 0
social impact