By Insights Team in AI — 27 Jun 2025

Breaking Through the Bottleneck of General Domain Reasoning! Tsinghua NLP Lab's New Reinforcement Learning Research RLPR}

Tsinghua NLP Lab introduces RLPR, a novel reinforcement learning method that enhances reasoning across diverse fields, overcoming previous limitations in natural language understanding.

Yu Tianyu, a first-year PhD student in the Department of Computer Science at Tsinghua University, supervised by Associate Professor Liu Zhiyuan of the Tsinghua NLP Laboratory, focuses on efficient multimodal large models, multimodal alignment, and reinforcement learning. He has published multiple papers at top AI conferences such as CVPR and AAAI, with over 1,000 citations on Google Scholar.

Deepseek's R1, OpenAI's o1/o3, and other reasoning models demonstrate the enormous potential of RLVR (Reinforcement Learning with Verifiable Reward).

However, current methods are limited to fields like mathematics and coding. For the rich diversity of natural language, rule-based verifiers are difficult to extend to general domains.

To address this challenge, Tsinghua NLP Lab proposes a key technique—Reinforcement Learning with Reference Probability Reward (RLPR).

Paper Title: RLPR: Extrapolating RLVR to General Domains without Verifiers
Paper Link: https://github.com/OpenBMB/RLPR/blob/main/RLPR_paper.pdf
GitHub Repository: https://github.com/OpenBMB/RLPR

This technique significantly improves the quality of probability-based rewards (PR), outperforming likelihood-based baselines in performance and training stability.

Additionally, RLPR introduces a dynamic filtering mechanism based on reward standard deviation, further enhancing stability and performance. All related code, models, data, and papers are open source.

Why is PR effective? Exploring the intrinsic evaluation of models

Researchers observed that the probability of generating reference answers in large language models (LLMs) directly reflects the model’s self-assessment of reasoning quality. Higher correctness correlates with higher reference answer probabilities.

For example, when the model outputs o2 and mistakenly ranks option A second, the probability of the correct answer at that position drops significantly. This phenomenon indicates that PR can accurately capture the model’s judgment of its reasoning quality, closely related to correctness.

^{PR example, deeper color indicates higher output probability}

Core features of RLPR

Domain-independent high-efficiency reward generation

Existing RLVR methods often require extensive manual effort and domain-specific rules. In contrast, RLPR only needs a single forward pass to generate reward scores based on the mean generation probability of reference answers, effectively handling the complexity of natural language.

As shown in the right example, rule-based methods cannot recognize semantic equivalence between y2/y3 and the reference answer, but RLPR’s PR mechanism accurately assigns higher scores to these answers.

^{Reward bias correction and dynamic filtering}

While basic PR correlates well with answer quality, it can be affected by irrelevant factors like question style. RLPR introduces a bias correction by constructing a control reward without reasoning steps, using difference methods to remove unrelated influences.

^{Dynamic filtering based on reward standard deviation}

RLPR employs a dynamic filtering mechanism based on reward standard deviation, updating thresholds via exponential moving average (EMA) to improve training stability and effectiveness.

Reliable reward quality and robustness

Using ROC-AUC metrics, the team evaluated reward quality from various sources. Results show PR surpasses rule-based and verifier-based rewards at 0.5B scale, with quality further improving to 0.91 as model capabilities grow.

^{PR reward quality exceeds rule-based and verifier-based rewards}

To verify robustness, the team trained Qwen2.5 3B with various templates using RLPR, observing consistent performance improvements across different training setups.

^{RLPR shows stable improvements across different training templates}

Further experiments on models like Gemma, Llama, and Qwen confirmed RLPR’s ability to consistently enhance reasoning capabilities, outperforming rule-based RLVR baselines.

^{RLPR consistently improves reasoning across different base models}

Summary

RLPR introduces an innovative Prob-to-Reward mechanism, solving domain dependence issues in RLVR. Extensive validation on models like Gemma, Llama, and Qwen demonstrates its effectiveness and potential for scaling RL.

More details can be found in the original paper.

Summary

Subscribe to QQ Insights