ICCV 2025|UV-CoT: A New Breakthrough in Unsupervised Visual Reasoning and Preference Optimization for Image-Level Thought Chains}
UV-CoT introduces an unsupervised visual reasoning framework that automatically identifies key image regions, reshaping image-based thought chains with preference optimization.


The first author is Zhao Kesen, a PhD student at Nanyang Technological University, specializing in Reinforcement Learning in MLLMs. The paper has been accepted at ICCV 2025.
With the success of Chain-of-Thought (CoT) reasoning in text, researchers are now applying this method to visual understanding to enhance reasoning ability and interpretability.
However, existing models are limited to fixed-granularity text-level reasoning, unable to dynamically focus on key image regions based on semantic cues. To address this, the paper proposes UV-CoT (Unsupervised Visual Chain-of-Thought), a novel unsupervised visual reasoning framework.
This approach mimics human visual understanding by focusing on "key regions → reasoning process" (illustrated below), designing unsupervised data generation and preference optimization mechanisms to dynamically focus on key areas, significantly improving spatial perception and multimodal reasoning.

- Paper Title: Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
- Link: https://arxiv.org/abs/2504.18397
- Project Page: https://kesenzhao.github.io/my_project/projects/UV-CoT.html
- Code Repository: https://github.com/kesenzhao/UV-CoT
- Open-source Model: https://huggingface.co/papers/2504.18397
Background: Supervised training is costly
Existing methods use supervised fine-tuning (SFT) with large annotated datasets, marking key regions and reasoning steps manually. Challenges include:
- High annotation cost and poor scalability: Labeling key regions and reasoning paths is labor-intensive, especially for complex visual semantics, limiting multi-task and large-scale applications.
- Limited generalization due to single training signal: SFT only uses correct samples, ignoring other plausible regions and paths, reducing robustness in unseen scenarios.
UV-CoT introduces an automated preference data generation and evaluation process, combined with an improved preference optimization algorithm, Score-DPO (sDPO), guiding models to learn image-level thought chains in an unsupervised manner (illustrated below).

Contribution 1: Unsupervised preference data generation and evaluation
UV-CoT uses the target model

and an evaluation model

to generate diverse intermediate reasoning responses for image-question pairs, then constructs preference datasets based on scoring.
- Response Generation: At each reasoning step, the target model generates n diverse responses, including bounding boxes and intermediate results.
- Response Evaluation: The evaluation model scores each response, and responses are ranked to form preference data, guiding the next reasoning step.
By dynamically generating preference data, UV-CoT reduces reliance on high-quality annotations, enabling unsupervised image-level reasoning.

Contribution 2: sDPO and iterative learning
UV-CoT employs an improved direct preference optimization algorithm, sDPO, which incorporates preference score differences to better model the importance of key regions, and uses iterative learning to adaptively refine the model.
The sDPO loss function is shown below:

Compared to standard DPO, sDPO introduces preference score differences, quantifying the strength of preferences between responses, improving focus on critical image regions.
Iterative learning (Algorithm 2): The dataset is split into m subsets, and the model iteratively generates preference data D_i, then updates via sDPO, gradually improving robustness and alignment with preferences.
By dynamically updating preference data, this method alleviates distribution gaps and enhances training robustness.

Experimental Highlights
Significant performance improvements (Table 1): On six benchmarks, UV-CoT outperforms supervised Chain-of-Thought models like Visual-CoT-7B, surpassing models like LLaVA-1.5-7B and other unsupervised models.
Strong generalization and scalability (Table 2): In zero-shot settings, UV-CoT improves by an average of 2.5%, reaching 5.1% with additional unlabeled data.
High-resolution scene competence (Table 3): On V* Bench, UV-CoT scores an average of 0.402, a 5.5% increase, especially boosting OCR tasks by 8.4%.


High-quality boundary box generation without relying on evaluation models (Table 5): UV-CoT outperforms LLaVA-1.5-7B (+4.8%) and approaches OmniLMM-12B (-0.2%) in self-assessment, and improves reasoning performance when boundary boxes are used.

Visualization of preferences and reasoning chains: The visualizations below show how preferences are generated and how reasoning chains are constructed, illustrating the interpretability of UV-CoT.


Conclusion
UV-CoT presents an innovative unsupervised visual reasoning framework that automatically identifies key image regions and optimizes reasoning through preference-based feedback, laying a foundation for scalable, efficient multimodal understanding.