TRecViT: A Recurrent Video Transformer

[Submitted on 18 Dec 2024]

Authors:Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu

View a PDF of the paper titled TRecViT: A Recurrent Video Transformer, by Viorica Pu{a}tru{a}ucean and 12 other authors

View PDF
HTML (experimental)

Abstract:We propose a novel block for video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture TRecViT performs well on sparse and dense tasks, trained in supervised or self-supervised regimes. Notably, our model is causal and outperforms or is on par with a pure attention model ViViT-L on large scale video datasets (SSv2, Kinetics400), while having $3times$ less parameters, $12times$ smaller memory footprint, and $5times$ lower FLOPs count. Code and checkpoints will be made available online at this https URL.

Submission history

From: Viorica Patraucean Dr [view email]
[v1]
Wed, 18 Dec 2024 19:44:30 UTC (15,102 KB)

Source link
lol

TRecViT: A Recurrent Video Transformer

Submission history

By stp2y

Leave a Reply Cancel reply