SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos

Osman, N.; Camporese, G.; Coscia, P.; Ballan, L.

doi:10.1109/ICCVW54120.2021.00383

Action anticipation in egocentric videos is a difficult task due to the inherently multi-modal nature of human actions. Additionally, some actions happen faster or slower than others depending on the actor or surrounding context which could vary each time and lead to different predictions. Based on this idea, we build upon RULSTM architecture, which is specifically designed for anticipating human actions, and propose a novel attention-based technique to evaluate, simultaneously, slow and fast features extracted from three different modalities, namely RGB, optical flow and extracted objects. Two branches process information at different time scales, i.e., frame-rates, and several fusion schemes are considered to improve prediction accuracy. We perform extensive experiments on EpicKitchens55 and EGTEA Gaze+ datasets, and demonstrate that our technique systematically improves the results of RULSTM architecture for Top-5 accuracy metric at different anticipation times.

SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos / N. Osman, G. Camporese, P. Coscia, L. Ballan (... IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS.). - In: 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)[s.l] : IEEE, 2021. - ISBN 978-1-6654-0191-3. - pp. 3430-3438 (( Intervento presentato al 18. convegno IEEE/CVF International Conference on Computer Vision Workshops tenutosi a Montreal nel 2021 [10.1109/ICCVW54120.2021.00383].

SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos

Osman N.;Camporese G.;P. Coscia^Penultimo;Ballan L.

2021

Abstract

Action anticipation in egocentric videos is a difficult task due to the inherently multi-modal nature of human actions. Additionally, some actions happen faster or slower than others depending on the actor or surrounding context which could vary each time and lead to different predictions. Based on this idea, we build upon RULSTM architecture, which is specifically designed for anticipating human actions, and propose a novel attention-based technique to evaluate, simultaneously, slow and fast features extracted from three different modalities, namely RGB, optical flow and extracted objects. Two branches process information at different time scales, i.e., frame-rates, and several fusion schemes are considered to improve prediction accuracy. We perform extensive experiments on EpicKitchens55 and EGTEA Gaze+ datasets, and demonstrate that our technique systematically improves the results of RULSTM architecture for Top-5 accuracy metric at different anticipation times.

Scheda breve

Scheda completa

Scheda completa (DC)

	Settori scientifico-disciplinari del contributo (sola visualizzazione)
	
				Settore INF/01 - Informatica
Settore ING-INF/05 - Sistemi di Elaborazione delle Informazioni
			
	Data di pubblicazione
	
				2021
			
	Enti collegati al convegno
	
				CVF
IEEE
			
	DOI
	
				https://dx.doi.org/10.1109/ICCVW54120.2021.00383
			
	Tipologia
	
				Book Part (author)
			
	Appare nelle tipologie:
	
				03 - Contributo in volume

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2434/914565

Citazioni

ND

12

7

IRIS Institutional Research Information System - AIR Archivio Istituzionale della Ricerca