Breaking Through the Bottleneck of General Domain Reasoning! Tsinghua NLP Lab's New Reinforcement Learning Research RLPR}
Tsinghua NLP Lab introduces RLPR, a novel reinforcement learning method that enhances reasoning across diverse fields, overcoming previous limitations in natural language understanding.


Yu Tianyu, a first-year PhD student in the Department of Computer Science at Tsinghua University, supervised by Associate Professor Liu Zhiyuan of the Tsinghua NLP Laboratory, focuses on efficient multimodal large models, multimodal alignment, and reinforcement learning. He has published multiple papers at top AI conferences such as CVPR and AAAI, with over 1,000 citations on Google Scholar.
Deepseek's R1, OpenAI's o1/o3, and other reasoning models demonstrate the enormous potential of RLVR (Reinforcement Learning with Verifiable Reward).
However, current methods are limited to fields like mathematics and coding. For the rich diversity of natural language, rule-based verifiers are difficult to extend to general domains.
To address this challenge, Tsinghua NLP Lab proposes a key technique—Reinforcement Learning with Reference Probability Reward (RLPR).

- Paper Title: RLPR: Extrapolating RLVR to General Domains without Verifiers
- Paper Link: https://github.com/OpenBMB/RLPR/blob/main/RLPR_paper.pdf
- GitHub Repository: https://github.com/OpenBMB/RLPR
This technique significantly improves the quality of probability-based rewards (PR), outperforming likelihood-based baselines in performance and training stability.
Additionally, RLPR introduces a dynamic filtering mechanism based on reward standard deviation, further enhancing stability and performance. All related code, models, data, and papers are open source.
Why is PR effective? Exploring the intrinsic evaluation of models
Researchers observed that the probability of generating reference answers in large language models (LLMs) directly reflects the model’s self-assessment of reasoning quality. Higher correctness correlates with higher reference answer probabilities.
For example, when the model outputs o2 and mistakenly ranks option A second, the probability of the correct answer at that position drops significantly. This phenomenon indicates that PR can accurately capture the model’s judgment of its reasoning quality, closely related to correctness.

PR example, deeper color indicates higher output probability
Core features of RLPR
Domain-independent high-efficiency reward generation
Existing RLVR methods often require extensive manual effort and domain-specific rules. In contrast, RLPR only needs a single forward pass to generate reward scores based on the mean generation probability of reference answers, effectively handling the complexity of natural language.
As shown in the right example, rule-based methods cannot recognize semantic equivalence between y2/y3 and the reference answer, but RLPR’s PR mechanism accurately assigns higher scores to these answers.

Reward bias correction and dynamic filtering
While basic PR correlates well with answer quality, it can be affected by irrelevant factors like question style. RLPR introduces a bias correction by constructing a control reward without reasoning steps, using difference methods to remove unrelated influences.

Dynamic filtering based on reward standard deviation
RLPR employs a dynamic filtering mechanism based on reward standard deviation, updating thresholds via exponential moving average (EMA) to improve training stability and effectiveness.
Reliable reward quality and robustness
Using ROC-AUC metrics, the team evaluated reward quality from various sources. Results show PR surpasses rule-based and verifier-based rewards at 0.5B scale, with quality further improving to 0.91 as model capabilities grow.

PR reward quality exceeds rule-based and verifier-based rewards
To verify robustness, the team trained Qwen2.5 3B with various templates using RLPR, observing consistent performance improvements across different training setups.

RLPR shows stable improvements across different training templates
Further experiments on models like Gemma, Llama, and Qwen confirmed RLPR’s ability to consistently enhance reasoning capabilities, outperforming rule-based RLVR baselines.

RLPR consistently improves reasoning across different base models
Summary
RLPR introduces an innovative Prob-to-Reward mechanism, solving domain dependence issues in RLVR. Extensive validation on models like Gemma, Llama, and Qwen demonstrates its effectiveness and potential for scaling RL.
More details can be found in the original paper.