First Integration of RL and SFT for Efficient Reasoning Model Training}
A novel approach combining reinforcement learning (RL) and supervised fine-tuning (SFT) dynamically guides large models, enhancing training efficiency and reasoning capabilities.


Dr. Liu Zirui graduated from City University of Hong Kong in Data Science, supervised by Professor Zhao Xiangyu and Professor Zhou Dingxuan. Currently, he works at Huawei Hong Kong Research Institute on RLVR algorithms and diffusion language models. Gong Cheng is pursuing his PhD at City University of Hong Kong under Professor Zhang Qingfu, with research internships at Huawei Hong Kong Research Institute focusing on RLVR algorithms.
Next-generation large reasoning models like OpenAI-o3, DeepSeek-R1, and Kimi-1.5 have made significant progress in complex reasoning. The core method, ZERO-RL, employs verifiable reward reinforcement learning (RLVR) to gradually improve pass@1 in challenging scenarios such as math and coding. Compared to supervised fine-tuning (SFT) that relies on high-quality human data or thought chains extracted from powerful models, RL-based post-training shows stronger generalization in reasoning enhancement.
However, RLVR methods like Group Relative Policy Optimization (GRPO) face two main limitations: 1. Disparity between training data difficulty and model capability causes sparse rewards, hindering stable learning. 2. Low sample efficiency of on-policy RL algorithms, especially problematic for edge devices with small models.
To address this, Huawei Hong Kong Research Institute, Noah’s Ark Lab, and City University of Hong Kong collaborated to develop the GHPO algorithm framework, which integrates online reinforcement learning with imitation learning and can adaptively switch between them.

GHPO consists of two core modules:
- Automatic Difficulty Detection: This module assesses the intrinsic difficulty of current problems to guide subsequent learning. It does not require pre-set standards or auxiliary large models, relying solely on the distribution of generated solutions, enabling efficient training that evolves with model capability.
- Adaptive Prompt Switching: Based on difficulty detection, this module dynamically adjusts the proportion of standard solutions integrated into the training process. It employs a multi-stage guidance strategy, gradually increasing prompt length if the model fails to produce correct responses, balancing exploration and guidance.
The objective function of GHPO can be expressed as shown below:

This innovative approach ensures guidance is provided only when necessary, preserving valuable exploration within the model’s current ability and offering adaptive optimization for more challenging scenarios.
Comprehensive Performance Surpassing GRPO, Open Source Code and Data Available
GHPO’s implementation is based on the Openr1 project, using the TRL training framework and vLLM for inference acceleration. The team directly implemented GHPOTrainer within TRL, with plans for future integration.
Experiments on six major math benchmarks, including Qwen2.5-7B, demonstrate that GHPO achieves an average performance improvement of 4.5% over GRPO.

GHPO also exhibits more stable gradient updates during training and can be applied to other models like Qwen2.5-Math-7B, with detailed case studies illustrating its effectiveness.
Summary and Future Outlook: Bridging Reinforcement Learning and SFT
Since the advent of DeepSeek-R1, RL algorithms like GRPO have been popular for large model training, believed to enhance generalization. GHPO alleviates reward sparsity issues and enables adaptive switching between RL and imitation learning, offering new perspectives for AI development and deeper integration of these methods.