A Study on Multimodal Foundation Models for Affective Video Prediction

Agnelli, F.; Ditroia, M.; Blandano, G.; D'Amelio, A.; Ghezzi, O.; Marco De Paoli,; Lanzarotti, R.

doi:10.1007/978-3-032-10192-1_19

This study investigates the impact of incorporating multimodal affective signals into foundation models, focusing on whether such integration enhances their ability to represent emotional content. We compare task-agnostic foundation models with those pretrained specifically on affective data to assess the benefits of affective pretraining. Additionally, we examine how multimodal pretraining influences the robustness and generalizability of affective representations, emphasizing the role of cross-modal learning. In particular, we explore how pretraining on different combinations of modalities—audio-text, audio-video, video-text, and all three—affects the quality of representations learned for each modality, especially when certain modalities are absent during downstream emotion recognition tasks. Our experiments across multiple benchmarks demonstrate that incorporating additional modalities during pretraining consistently improves performance, even when those modalities are unavailable at inference time. Notably, models pretrained on all three modalities achieve the highest generalizability, showing significant gains in both low-resource settings and related tasks such as humor detection. These findings highlight the value of cross-modal knowledge transfer and underscore the importance of modality-rich pretraining for building robust, adaptable emotion-aware systems.

A Study on Multimodal Foundation Models for Affective Video Prediction / F. Agnelli, M. Ditroia, G. Blandano, A. D'Amelio, O. Ghezzi, M. De Paoli, R. Lanzarotti (LECTURE NOTES IN COMPUTER SCIENCE). - In: Image Analysis and Processing – ICIAP 2025 / [a cura di] E. Rodolà, F. Galasso, I. Masi. - [s.l] : Springer Nature Switzerland, 2025 Sep 15. - ISBN 9783032101914. - pp. 224-235 (( 23. International Conference Roma 2025 [10.1007/978-3-032-10192-1_19].

A Study on Multimodal Foundation Models for Affective Video Prediction

F. Agnelli;Mirko Ditroia;G. Blandano;A. D'Amelio;Omar Ghezzi;Marco De Paoli;R. Lanzarotti

2025

Abstract

This study investigates the impact of incorporating multimodal affective signals into foundation models, focusing on whether such integration enhances their ability to represent emotional content. We compare task-agnostic foundation models with those pretrained specifically on affective data to assess the benefits of affective pretraining. Additionally, we examine how multimodal pretraining influences the robustness and generalizability of affective representations, emphasizing the role of cross-modal learning. In particular, we explore how pretraining on different combinations of modalities—audio-text, audio-video, video-text, and all three—affects the quality of representations learned for each modality, especially when certain modalities are absent during downstream emotion recognition tasks. Our experiments across multiple benchmarks demonstrate that incorporating additional modalities during pretraining consistently improves performance, even when those modalities are unavailable at inference time. Notably, models pretrained on all three modalities achieve the highest generalizability, showing significant gains in both low-resource settings and related tasks such as humor detection. These findings highlight the value of cross-modal knowledge transfer and underscore the importance of modality-rich pretraining for building robust, adaptable emotion-aware systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Parole chiave
	
				Affective computing; Multimodal learning; Cross-modal learning; Foundation models; Emotion recognition; Contrastive learning
			
	Settori scientifico-disciplinari del contributo (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Data di pubblicazione
	
				15-set-2025
			
	DOI
	
				https://dx.doi.org/10.1007/978-3-032-10192-1_19
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

File	Dimensione	Formato
978-3-032-10192-1_19.pdf accesso riservato Tipologia: Publisher's version/PDF Licenza: Nessuna licenza Dimensione 1.62 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.62 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/1210236

Citazioni

ND

0

ND

0

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca