ICML 2025 Oral Presentation Upgraded! Shanghai AI Lab Collaborates with Fudan and HKU to Launch VideoRoPE++ for Longer Video Understanding}

Shanghai AI Lab, Fudan University, and HKU jointly release VideoRoPE++, an advanced tool supporting longer video comprehension, showcased at ICML 2025 oral session.

ICML 2025 Oral Presentation Upgraded! Shanghai AI Lab Collaborates with Fudan and HKU to Launch VideoRoPE++ for Longer Video Understanding}

图片

First author Wei Xilin, a second-year PhD student at Fudan University’s School of Computer Science, specializes in multimodal large models and efficient long-context processing. Mentors: Zang Yuhang and Wang Jiaqi.

1. Background

While rotary position encoding (RoPE) and its variants are widely used for long-context processing, extending 1D RoPE to videos with complex spatiotemporal structures remains a challenge.

VideoRoPE++ is a comprehensive analysis that identifies five key features necessary for effective video application, which previous work overlooked. It introduces a new benchmark, V-RULER, highlighting issues with current RoPE variants under periodic interference.

Based on this, VideoRoPE++ features a 3D structure designed to preserve spatiotemporal relationships, including low-frequency time allocation to reduce oscillations, diagonal layouts for spatial symmetry, and adjustable time intervals for decoupling.

To improve extrapolation beyond training scope, the team proposes YaRN-V, interpolating only in the low-frequency temporal domain while maintaining spatial stability, excelling in long video tasks like retrieval and understanding.

图片
图片

2. Analysis

Figure: To demonstrate the importance of frequency allocation, the new benchmark V-RULER includes the subtask Needle Retrieval under Distractors (NRD), where similar images are inserted as interference. VideoRoPE++ shows superior robustness over M-RoPE, especially under distractors.

图片

In long video understanding, VideoRoPE++ outperforms other variants (Vanilla RoPE, TAD-RoPE, M-RoPE) across benchmarks like LongVideoBench, MLVU, and Video-MME, especially at 64k context length, demonstrating its ability to capture long-range dependencies.

图片

In extrapolation tasks, YaRN-V significantly outperforms baselines like NTK-Aware and M-RoPE++, especially in handling super-long positional indices under multimodal interference, proving its robustness in long input scenarios.

3. Conclusion

Effective position encoding requires 2D/3D structure, frequency allocation, spatial symmetry, and temporal scaling. VideoRoPE++ with its 3D structure, low-frequency allocation, diagonal layout, and YaRN-V, excels in long video tasks.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe