By Insights Team in AI — 20 Jul 2025

Don't Rush to Crown OpenAI! Tao Zhijian: The Value of 'Gold Medals' Depends on the 'Competition System'}

OpenAI claims their reasoning model achieved gold medal performance in IMO, but Tao Zhijian warns that the true value depends on the evaluation system and standards used, urging caution in interpretation.

Yesterday, OpenAI announced a major milestone: their reasoning model achieved a gold medal-level performance at the International Mathematical Olympiad (IMO).

OpenAI researcher Alexander Wei stated that during evaluation, the team strictly followed human competition rules: the model had to read official problems and write natural language proofs without any tools or internet assistance during two 4.5-hour exam sessions.

In the assessment, the model successfully solved five out of six IMO problems in 2025, scoring 35 out of 42 points, enough for a gold medal. Each problem was independently scored by three former IMO medalists, and final scores were agreed upon.

After the announcement, the entire AI community was excited. Wei Wei also shared the proof generated by OpenAI’s new model.

Proof link: https://github.com/aw31/openai-imo-2025-proofs/

Interestingly, shortly before Wei Wei posted this, AI models performed poorly in IMO, with Gemini 2.5 Pro scoring only 13 points, and OpenAI’s o3 (high) scoring just 7. The official announcement of the new model’s performance marked a surprising turnaround.

However, Tao Zhijian cautioned against overinterpreting these results. He argued that without strict, standardized testing conditions, comparing AI models to humans or to each other is meaningless. He stated he would not comment on AI competition results that are not openly disclosed before the contest.

Tao Zhijian: Be Cautious About AI IMO Results

In his blog, Tao Zhijian explained that current AI capabilities are often viewed as a single metric: task X is either within the current tools’ ability or not. But in reality, AI performance can vary by several orders of magnitude depending on resources, assistance, and reporting standards.

He used the recent IMO as an example:

Standard system: six high school students per country, two days, 4.5 hours per day, solving three problems independently without communication, under strict exam conditions.
Modified scenarios include: more days, rewriting problems in easier forms, unlimited calculator or internet use, team collaboration, hints from leaders, selective submission, or even no submission if no solutions are found.

In each case, the answers are still generated by students, not leaders, but the success rate can be significantly affected by these changes. A student who cannot win a bronze medal under standard conditions might win gold under relaxed conditions.

Therefore, without a controlled, standardized testing methodology, comparing AI models’ IMO performance or equating it with human results is unreliable.

Regarding AI models’ self-reported results, Tao Zhijian emphasized that he would not comment on unverified claims.

Netizens: A proof that can fill pages is worth noting

Discussions among netizens followed Tao Zhijian’s caution. Some pointed out that the AI models challenging IMO might not be OpenAI’s, so some issues may not apply directly.

For example, OpenAI claims they did not use tool calls in their model.

Some argue that since these models memorize vast internet data during training, it’s unfair to compare them with humans who lack internet access.

Others believe that the real takeaway is that AI can perform reasoning for over an hour in a domain that’s hard to verify.

OpenAI’s approach remains mysterious: they mention releasing GPT-5 soon, but clarify that the model that achieved IMO gold is not GPT-5. The new model might take months to be released.

Who is Alexander Wei, the leader behind the mysterious model?

This reasoning model that won IMO gold was developed by a team led by Alexander Wei. Noam Brown mentioned that Wei took a bold, unconventional approach, achieving results few believed possible.

Wei focuses on improving reasoning in large language models, especially in mathematical reasoning and natural language proof generation. He also won a gold medal at the International Olympiad in Informatics (IOI).

He earned his PhD in computer science from UC Berkeley in 2023, advised by Nika Haghtalab, Michael I. Jordan, and Jacob Steinhardt. Previously, he completed his bachelor’s and master’s at Harvard University in 2020. His research has received awards like the SODA Best Student Paper and the Rothkopf Prize in auction and market design.

Before joining OpenAI, Wei worked at Meta AI (FAIR), Microsoft Research, and D. E. Shaw, gaining extensive research and industry experience.

At Meta AI, he contributed to the development of CICERO, an AI system that achieved top human-level performance in the strategy game Diplomacy, published in Science in 2022.

Wei’s team’s work has injected confidence into OpenAI, which is currently facing a talent crunch. Noam Brown also hinted that OpenAI remains a frontier tech lab, with some of its technologies ahead of others by months. Working here means access to cutting-edge developments first-hand. Do you agree?

Reference: Wei’s Twitter

Another related tweet

Tao Zhijian’s discussion on Mathstodon

Subscribe to QQ Insights