This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
VRoPE: Rotary Position Embedding for Video Large Language Models
Created by
Haebom
Author
Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, Jing Liu
Outline
This paper proposes Video Rotary Position Embedding (VRoPE), an improved version of Rotary Position Embedding (RoPE) for video. The existing RoPE-3D encodes spatial and temporal dimensions separately, leading to positional bias in attention distribution and confusion in video-to-text transitions. However, VRoPE mitigates this bias and ensures a uniform distribution of spatial focus through a more balanced encoding strategy. Furthermore, it reconstructs the position index to ensure smooth transitions between video and text tokens. Experimental results on various models demonstrate that VRoPE significantly improves performance on video understanding, temporal reasoning, and retrieval tasks compared to existing RoPE variants. The source code is available at https://github.com/johncaged/VRoPE .