Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Scaling 4D Representations

Created by
  • Haebom

Author

Jo ao Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran, Drew A. Hudson, Pedro V elez, Luisa Polania , Luke Friedman, Chris Duvarney, Ross Goroshin, Kelsey Allen, Jacob Walker, Rishabh Kabra, Eric Aboussouan, Jennifer Sun, Thomas Kipf, Carl Doersch, Viorica P\u{a}tr\u{a}ucean, Dima Damen, Pauline Luc, Mehdi SM Sajjadi, Andrew Zisserman

Outline

In this paper, we evaluate the scalability of purely self-supervised learning from video data, focusing on video-based spatial (3D) and temporal (+1D=4D) unsupervised learning tasks, such as camera pose estimation, point and object tracking, and depth estimation. Unlike previous studies that mainly focus on semantic tasks (e.g., action classification, ImageNet classification), in this work, we demonstrate that the performance on 4D vision tasks continuously improves as the model size increases from 20M to 22B parameters using a Transformer-based mask auto-encoding (MAE) model trained on a large video dataset. We demonstrate the scalability advantage of the 4D representation through comparative analysis with various state-of-the-art image and video models, and the pretrained models are available in an open repository.

Takeaways, Limitations

Takeaways:
We successfully demonstrate the scalability of self-supervised learning using large-scale video datasets on a spatial-temporal (4D) vision task.
Experimentally demonstrating the effectiveness of 4D representation learning in Transformer-based MAE models.
We present and publish a large-scale self-supervised learning video model with 22B parameters.
Demonstrates improved performance on various video-based tasks.
Limitations:
The scalability of self-supervised learning to semantic tasks has not yet been sufficiently verified.
The proposed model may be computationally expensive.
Focused on performance evaluation for specific types of video data, further research on generalization performance is needed.
👍