This study investigates the impact of incorporating multimodal affective signals into foundation models, focusing on whether such integration enhances their ability to represent emotional content. We compare task-agnostic foundation models with those pretrained specifically on affective data to assess the benefits of affective pretraining. Additionally, we examine how multimodal pretraining influences the robustness and generalizability of affective representations, emphasizing the role of cross-modal learning. In particular, we explore how pretraining on different combinations of modalities—audio-text, audio-video, video-text, and all three—affects the quality of representations learned for each modality, especially when certain modalities are absent during downstream emotion recognition tasks. Our experiments across multiple benchmarks demonstrate that incorporating additional modalities during pretraining consistently improves performance, even when those modalities are unavailable at inference time. Notably, models pretrained on all three modalities achieve the highest generalizability, showing significant gains in both low-resource settings and related tasks such as humor detection. These findings highlight the value of cross-modal knowledge transfer and underscore the importance of modality-rich pretraining for building robust, adaptable emotion-aware systems.
A Study on Multimodal Foundation Models for Affective Video Prediction / F. Agnelli, M. Ditroia, G. Blandano, A. D'Amelio, O. Ghezzi, M. De Paoli, R. Lanzarotti (LECTURE NOTES IN COMPUTER SCIENCE). - In: Image Analysis and Processing – ICIAP 2025 / [a cura di] E. Rodolà, F. Galasso, I. Masi. - [s.l] : Springer Nature Switzerland, 2025 Sep 15. - ISBN 9783032101914. - pp. 224-235 (( 23. International Conference Roma 2025 [10.1007/978-3-032-10192-1_19].
A Study on Multimodal Foundation Models for Affective Video Prediction
F. Agnelli
;G. Blandano;A. D'Amelio;R. Lanzarotti
2025
Abstract
This study investigates the impact of incorporating multimodal affective signals into foundation models, focusing on whether such integration enhances their ability to represent emotional content. We compare task-agnostic foundation models with those pretrained specifically on affective data to assess the benefits of affective pretraining. Additionally, we examine how multimodal pretraining influences the robustness and generalizability of affective representations, emphasizing the role of cross-modal learning. In particular, we explore how pretraining on different combinations of modalities—audio-text, audio-video, video-text, and all three—affects the quality of representations learned for each modality, especially when certain modalities are absent during downstream emotion recognition tasks. Our experiments across multiple benchmarks demonstrate that incorporating additional modalities during pretraining consistently improves performance, even when those modalities are unavailable at inference time. Notably, models pretrained on all three modalities achieve the highest generalizability, showing significant gains in both low-resource settings and related tasks such as humor detection. These findings highlight the value of cross-modal knowledge transfer and underscore the importance of modality-rich pretraining for building robust, adaptable emotion-aware systems.| File | Dimensione | Formato | |
|---|---|---|---|
|
978-3-032-10192-1_19.pdf
accesso riservato
Tipologia:
Publisher's version/PDF
Licenza:
Nessuna licenza
Dimensione
1.62 MB
Formato
Adobe PDF
|
1.62 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.




