Princeton Team Releases the Most Powerful Open-Source Mathematical Theorem Prover: 32B Model Surpasses Previous SOTA DeepSeek 671B}

Led by Princeton, a new open-source theorem proving model with 32B parameters outperforms previous models like DeepSeek 671B, marking a significant breakthrough in AI mathematics reasoning.

Princeton Team Releases the Most Powerful Open-Source Mathematical Theorem Prover: 32B Model Surpasses Previous SOTA DeepSeek 671B}

图片

Recently, Princeton University, in collaboration with Tsinghua University, Peking University, Shanghai Jiao Tong University, Stanford University, Nvidia, Amazon, Meta FAIR, and other top institutions, launched the new open-source mathematical theorem proving model — Goedel-Prover-V2.

This flagship 32B model significantly outperforms the previous state-of-the-art open-source model DeepSeek-Prover-V2-671B across multiple automated theorem proving benchmarks. Meanwhile, the smaller 8B model performs on par with the 671B model in certain benchmarks, demonstrating new efficiency and capability breakthroughs.

图片

Key Achievements

New record on MiniF2F: The 32B flagship model achieved an 8.0% improvement in Pass@32 accuracy over the previous SOTA DeepSeek-Prover-V2-671B.

Small but powerful: The 8B model matches the performance of the 671B SOTA model.

Top on PutnamBench: The model ranks first on the challenging Putnam Mathematical Competition benchmark.

图片

Figure 1: Performance comparison across MiniF2F, PutnamBench, and MathOlympiadBench (including 360 Olympiad-level problems). The x-axis shows different models, and the y-axis shows success rates or number of problems solved.

The figure illustrates the performance of Goedel-Prover-V2 on three benchmarks under Pass@32. All data are measured with Pass@32:

  • In all three datasets, the 32B flagship model significantly outperforms previous models like DeepSeek-Prover-V2-671B and Kimina-Prover-72B in both standard and self-correcting modes.
  • On MiniF2F, the 8B model performs comparably to the much larger DeepSeek-Prover-V2-671B.

PutnamBench Rankings

The latest rankings show that Goedel-Prover-V2-32B achieves top results with fewer computational passes.

图片

Table 1: PutnamBench leaderboard.

Reasoning scalability during inference

The scalability curve indicates that Goedel-Prover-V2-32B maintains performance above previous models across different inference budgets.

图片

Figure 2: Performance on MiniF2F under varying sampling budgets. The x-axis shows pass counts, and the y-axis shows problem-solving success rates.

Technical Methods

The performance of Goedel-Prover-V2 relies on four core techniques:

  • Expert iteration & reinforcement learning (RL): Following standard training procedures—formalize problems, generate and verify proofs, use new proofs to train next-generation models, and optimize with RL.
  • Scaffolded data synthesis: Automatically generate proof tasks of increasing difficulty to bridge simple and complex problems, providing dense, informative training signals.
  • Verifier-guided self-correction: The model iteratively refines proofs using feedback from the Lean compiler, mimicking human self-correction.
  • Model averaging: Combine weights from different training checkpoints to improve robustness and performance, especially for higher sampling counts.

Model and Dataset Downloads

To promote research, the team has released the models and the new MathOlympiadBench benchmark.

Model downloads

Dataset download

MathOlympiadBench is a dataset containing formalized problems from Olympiad-level mathematics competitions, including IMO, IMO candidate shortlists, and regional contests, totaling 360 problems.

The team states that this release aims to support open-source research, including projects preparing for competitions like IMO. A detailed paper will be published in the coming weeks.

Key Researchers:

图片

Yong Lin, a postdoctoral researcher at Princeton, collaborates with Prof. Jin Chi and Prof. Arora on formal mathematical reasoning and post-training of large models. His work has received NAACL Outstanding Paper and was selected as an Apple AI Scholar in 2023. Personal homepage

图片

Shange Tang, a PhD student at Princeton, works on formal mathematical reasoning and out-of-distribution generalization, supervised by Prof. Jin Chi and Prof. Jianqing Fan. Personal homepage

Project Leaders:

图片

Prof. Chi Jin from Princeton’s Department of Electrical and Computer Engineering focuses on decision-making in machine learning, developing agents with complex reasoning. His team is expanding into large language models to enhance reasoning capabilities. He has received numerous honors, including the Sloan Research Fellowship and NSF CAREER Award. Personal homepage

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe