Enable Multimodal Large Models to 'Think Before Drawing'! HKU and Others Open Source GoT-R1: Reinforcement Learning Unlocks New Visual Reasoning Paradigm}
Hong Kong University and partners open-sourced GoT-R1, a multimodal model that uses reinforcement learning to enhance visual generation and reasoning, enabling models to 'think before drawing'.


Currently, multimodal large models have made significant progress in generating high-fidelity, semantically consistent images from complex text prompts. However, they still face challenges when handling instructions involving precise spatial relationships, multiple object attributes, and complex combinations.
In response, a research team from HKU MMLab, CUHK MMLab, and SenseTime, following their previous Generation Chain-of-Thought (GoT) framework, has announced a major advancement — GoT-R1.

This new framework introduces reinforcement learning to significantly enhance the semantic and spatial reasoning capabilities of multimodal models in visual generation tasks, enabling autonomous exploration and learning of better reasoning strategies. Both GoT and GoT-R1 are fully open-sourced.
- GoT arxiv: https://arxiv.org/pdf/2503.10639
- GoT github: https://github.com/rongyaofang/GoT
- GoT-R1 arxiv: https://arxiv.org/pdf/2505.17022
- GoT-R1 github: https://github.com/gogoduan/GoT-R1
GoT framework initially improves image generation accuracy by explicitly planning semantic content and spatial layout before image synthesis, but its reasoning mainly relies on supervised data based on predefined templates, limiting autonomous discovery of better strategies.
GoT-R1 aims to overcome these limitations by applying reinforcement learning to the semantic-spatial reasoning process, empowering the model to autonomously learn and optimize reasoning paths.

This approach involves a multi-dimensional reward framework based on MLLM, including:
- Reasoning Process Evaluation Reward (RPR): Using MLLM to assess the completeness, fidelity, and logical consistency of the reasoning chain.
- Spatial Alignment Reward (Rspa): Visualizing the spatial layout from reasoning outputs on a virtual canvas, then evaluating the layout's accuracy via MLLM.
These rewards, combined with the Group Relative Policy Optimization (GRPO) reinforcement learning algorithm, enable GoT-R1 models (e.g., Janus-Pro) to actively explore and learn more effective reasoning strategies, rather than merely repeating fixed patterns from training data.

Experimental results on the challenging T2I-CompBench show that GoT-R1 establishes new state-of-the-art performance, outperforming diffusion models like Stable Diffusion, layout-guided models like Ranni and LayoutGPT, and autoregressive models like Emu3 and Janus-Pro.
Compared to baseline models trained only with supervised GoT data, the reinforcement learning-optimized GoT-R1-7B model achieves up to 15% performance improvement across multiple categories, especially in texture and shape fidelity, demonstrating the effectiveness of autonomous reasoning path optimization.
To verify that reinforcement learning truly enhances the model's reasoning ability, the team used GPT-4o as an evaluator to compare the reasoning chains generated by GoT-R1 and baseline Janus-Pro models. The results show GPT-4o's clear preference for GoT-R1's reasoning chains, especially in complex spatial relation understanding, with 84 votes versus 16 for the baseline, confirming the model's improved reasoning quality.

GPT-4o's one-to-one comparison of reasoning chains before and after reinforcement learning highlights the significant improvement in logical clarity and fidelity achieved by GoT-R1.