By Insights Team in AI — 21 Jul 2025

Overcoming High-Resolution Image Reasoning Bottlenecks: Fudan University and Nanyang Technological University Propose MGPO, a Visual Grounding-Based Multi-Turn Reinforcement Learning Framework}

Fudan and Nanyang Technological University introduce MGPO, a multi-turn reinforcement learning framework leveraging visual grounding to enhance high-resolution image reasoning, validated at ICML 2025.

Authors | Research Team

Editor | ScienceAI

Advanced multi-modal large models (LMMs), such as Qwen2.5-VL, typically combine large language models (LLMs) with native resolution visual transformers (NaViT). However, high-resolution images generate massive visual tokens, most of which are irrelevant to the task, increasing computational load and interfering with key information capture.

Inspired by biological visual mechanisms, researchers from Fudan University and Nanyang Technological University propose a visual grounding-based multi-turn reinforcement learning method called MGPO. This enables LMMs to predict key region coordinates, crop sub-images, and incorporate historical context through multi-round interactions, achieving precise high-resolution image reasoning. Unlike supervised fine-tuning (SFT) that requires costly grounding annotations, MGPO demonstrates that even without such labels, models can develop robust visual grounding capabilities solely based on correctness feedback of the final answer.

Core Innovations of MGPO:

Top-down explainable visual reasoning: Provides a question-driven, interpretable visual search mechanism with grounding outputs.
Breaking pixel limits: Even when high-res images are scaled and become blurry, the model can accurately identify relevant regions and crop clear sub-images for analysis.
No extra grounding annotations needed: Can be trained directly on standard VQA datasets, with only answer supervision, to develop robust visual grounding.

Addressing the “Cold Start” Problem: Fixed Two-Round Dialogue Template

In practice, LLMs struggle to autonomously invoke grounding during rollout, slowing RL training. To solve this, a fixed two-round dialogue template is designed: first, the model predicts relevant region coordinates; second, it answers the question.

Handling High-Resolution Images: Coordinate Normalization and Sub-Image Cropping

Due to visual token limits, high-res images are scaled down, losing details. MGPO locates relevant regions in scaled images and then crops clear sub-images from the original high-res images, ensuring accurate reasoning.

Experimental Results

1. Comparison of Paradigms

On two high-resolution benchmarks, MME-Realworld (In-Distribution) and V* Bench (Out-of-Distribution), MGPO significantly outperforms SFT and GRPO, with improvements of 5.4% and 5.2%, respectively. Even trained on only 21,000 samples with a 7B model, MGPO surpasses OpenAI’s GPT-4 and GPT-3 models on V* Bench.

2. Emergence of Visual Grounding During RL Training

Statistics show that the proportion of effective grounding coordinates generated by MGPO increases significantly during training, demonstrating that the model can autonomously develop stable and accurate visual grounding capabilities solely based on answer correctness feedback.

Summary

MGPO activates visual grounding through multi-turn reinforcement learning, effectively addressing issues like visual token redundancy and information loss in high-resolution image processing. It demonstrates that models can develop robust grounding abilities without costly annotations, solely driven by answer correctness feedback.

Subscribe to QQ Insights