[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

OmniVec2 -- A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning

Created by
  • Haebom

Author

Siddharth Srivastava, Gaurav Sharma

Outline

In this paper, we present a novel multi-modality multi-task network and its associated learning algorithms that can process about 12 different modality data, including images, videos, audio, text, depth, point clouds, time series, tabular, graphs, Xlines, infrared, IMU, and hyperspectral. The proposed method projects data from different modalities into a unified embedding space by leveraging modality-specific tokenizers, shared transformer architectures, and cross-attention mechanisms. We address multi-modality and multi-task scenarios by integrating modality-specific task heads for different tasks in each modality. We propose a novel pre-training strategy with iterative modality switching to initialize the network, and a learning algorithm that provides a trade-off between fully joint learning for all modalities and learning two modalities at once. We provide comprehensive evaluations on 25 datasets from 12 modalities, demonstrating state-of-the-art performance, validating the effectiveness of the proposed architecture, pre-training strategy, and adaptive multi-task learning.

Takeaways, Limitations

Takeaways:
Presenting a new architecture that effectively integrates and processes data from various modalities
Presenting an effective solution to multi-modality multi-task problems
Proof of the superiority of the proposed pre-learning strategy and learning algorithm
Achieving state-of-the-art performance on diverse datasets
Limitations:
Lack of detailed analysis of the computational cost and complexity of the proposed method.
Potential overfitting for certain modalities
Lack of specific information about the 25 datasets used (dataset size, distribution, etc.)
Further research is needed on applicability to real-world applications.
👍