Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VRoPE: Rotary Position Embedding for Video Large Language Models

Created by
  • Haebom

Author

Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu

Outline

This paper proposes Video Rotary Position Embedding (VRoPE), an improved version of Rotary Position Embedding (RoPE) for video. The existing RoPE-3D encodes spatial and temporal dimensions separately, leading to positional bias in attention distribution and confusion in video-to-text transitions. However, VRoPE mitigates this bias and ensures a uniform distribution of spatial focus through a more balanced encoding strategy. Furthermore, it reconstructs the position index to ensure smooth transitions between video and text tokens. Experimental results on various models demonstrate that VRoPE significantly improves performance on video understanding, temporal reasoning, and retrieval tasks compared to existing RoPE variants. The source code is available at https://github.com/johncaged/VRoPE .

Takeaways, Limitations

Takeaways:
A novel approach to solving the positional encoding problem in video-LLM
Mitigating positional bias in attention distribution and improving video-to-text conversion.
Improving performance in video understanding, temporal reasoning, and search tasks.
Demonstrated superior performance over existing RoPE variants
Limitations:
Further research is needed to determine the generalization performance of the proposed method.
Performance evaluation on various video datasets is needed.
Need to analyze computational costs and memory usage
👍