By Insights Team in AI — 15 Jul 2025

Shanghai Jiao Tong University / Shanghai AI Lab Guangtao Zhai: When Evaluation Becomes Obsolete, AGI Is Achieved}

Zhai Guangtao from Shanghai Jiao Tong University argues that a perfect AI evaluation system is equivalent to AGI, suggesting that once evaluation is unnecessary, true AGI is realized.

Shanghai Jiao Tong University professor and Shanghai AI Lab researcher Zhai Guangtao believes that a perfect AI evaluation system is essentially equivalent to artificial general intelligence (AGI).

In the world of AI, numbers seem to be everything. Nearly every week, a new AI model breaks records in benchmarks like MMLU, SuperCLUE, HLE... These obscure codes, once confined to academia, now serve as universal currencies for measuring the "IQ" of large models, influencing market reputation and company valuation.

However, behind these victories, a deep confusion is spreading.

“We see models becoming more capable and scoring higher, but we don’t really know what that means for individuals,” Zhai Guangtao from Shanghai Jiao Tong University told Machine Heart. “Apart from simple tasks like text summarization, many real-world problems remain unsolved. Coding ability, after all, has limited audiences.”

This scholar, with over twenty years of experience in multimedia signal processing, is turning his attention to the increasingly important field of “AI evaluation.” As marginal benefits of pretraining diminish, evaluation has become a key lever to push technological progress, connect model potential with real value.

On the surface, current evaluation systems are rigorous and precise, but Zhai Guangtao believes that the score-based competition not only fails to accurately measure true intelligence but also leads us into collective misjudgment and over-optimism. The rising scores mask the flaws of the measuring tools themselves. This has led him to conclude that the dilemma of evaluation is fundamentally a misunderstanding of intelligence. The ultimate insight is:

“A perfect evaluation system itself is equivalent to AGI. Therefore, when we no longer need evaluation, AGI may have truly arrived.”

The Illusion of Progress

The primary issue with evaluation systems is “data pollution.” Since most benchmarks are public, their questions and answers are often incorporated into training data for subsequent models.

“Basically, big companies include data similar to evaluation datasets into their training sets,” Zhai Guangtao said. “This is almost industry standard—otherwise, models wouldn’t score well.”

This approach is problematic, akin to students getting exam answers in advance, undermining the diagnostic value of evaluation. A 2025 paper titled The Leaderboard Illusion revealed deeper issues. It found that the highly regarded Chatbot Arena platform has systemic flaws, with some major model vendors conducting secret internal tests, selectively publishing top results, and exploiting platform rules to gather massive user preference data, leading to overfitting to platform-specific biases rather than developing genuine general capabilities.

Second is the limitation of “atomized” ability evaluation.

To quantify capabilities, current systems tend to break down complex “intelligence” into atomic skills—mathematical reasoning, code generation, language understanding, common sense Q&A, etc. While convenient for scoring and comparison, this approach faces recognition challenges.

“A large model’s ability is like a sponge with many tiny holes,” Zhai Guangtao explained. “The problems detected during evaluation are often ‘holes,’ not the true boundary of the model’s ability. It’s hard to tell whether issues reflect actual limitations or are just due to lack of focus, which could be fixed with simple prompt tuning.”

By decomposing “intelligence” into hundreds of relatively independent atomic skills, the industry may be training models to excel at tests rather than becoming integrated, adaptable entities. In real applications, users value not just single-task performance but the ability to make comprehensive judgments and adapt in complex, multi-step workflows.

This disconnection is especially evident in embodied intelligence. For example, a robotic arm equipped with top VLA models can successfully “pick up a can on the table” in simulation with near 100% success, but in the real world, zero-shot success might be less than 5%. “This highlights the need for more refined, realistic evaluation systems,” Zhai Guangtao emphasized.

Due to limited resources, many academic researchers still see evaluation as a valuable contribution. However, to meet publication standards, they constantly propose new metrics and dimensions, which often go unmaintained after publication, leading to severe fragmentation in the field.

“Many evaluation efforts are somewhat helpless,” Zhai Guangtao admitted. “They become forced choices for publishing papers and claiming innovation.” This not only wastes resources but may also steer AI evaluation toward a technically sophisticated but practically questionable direction.

This raises a fundamental question: who is qualified to be the judge?

Corporate self-evaluation has clear conflicts of interest; academia, though more neutral, often lacks the resources for long-term maintenance of large-scale testing platforms.

Shanghai AI Lab’s role is to provide a unique perspective. As a new research institution, it can invest long-term resources to build a fair and impartial evaluation platform. “Sustained support for a fair leaderboard reflecting true capabilities is necessary and worthwhile,” Zhai Guangtao said.

Back to a Human-Centered Framework

Faced with current evaluation issues, Zhai Guangtao advocates returning to a “Human-Centered Evaluation” perspective.

“Large models ultimately serve humans,” he stressed. “The core of evaluation should not be a dialogue between models and benchmarks but how much they improve human task efficiency and experience.”

“Why has video quality improved so dramatically over the past 20 years? From blurry online videos to ubiquitous 4K and 8K ultra HD, it’s because visual signals have always aimed at user subjective experience (Quality of Experience, QoE). Looks good is good; looks bad is bad.”

Of course, subjective evaluation faces challenges in reproducibility and standardization. Zhai Guangtao’s team uses statistical methods to address this: “We don’t just look at one person’s experience but gather feedback from a representative group. For example, if the model targets researchers, we test with scientists; for K-12 education, we gather student feedback. Averaging or analyzing multiple responses reduces randomness and makes evaluation fairer.”

He leads the OpenCompass evaluation framework at Shanghai AI Lab, built on three pillars:

Information Quality: Is the output accurate, reliable, and free from hallucinations? Does it provide credible sources and in-depth information?
Problem-Solving Ability: In multi-step tasks, are the model’s decisions effective and well-planned? Does it handle edge cases well, improving user task completion?
Interaction Experience: Is collaboration with the model smooth, efficient, and intuitive? This is often overlooked in traditional offline evaluations.

To implement this philosophy, the Sainan team designed the “EDGE” approach: Evolving, Dynamic, Granular, Ecosystem. It’s a dynamic, evolving evaluation system that adapts to rapid AI development, with continuous difficulty progression, community sharing, and fine-grained diagnostics.

They also built a high-quality internal question bank of over 100,000 questions, planning to open-source 500 questions quarterly to ensure reproducibility and prevent data pollution.

Furthermore, they propose an ambitious goal—training and evaluation integrated.

“Evaluating outside training is pointless,” Zhai Guangtao emphasized. “The ultimate goal is to feed evaluation back into training.”

The key is an “AI Judger” trained on large-scale human feedback, which guides the base model’s optimization in subsequent training rounds.

Based on the “45° Law” of safety and performance growth proposed by Zhou Bowen at Shanghai AI Lab, the development of large models should follow a balanced trajectory, with capabilities and safety (robustness, fairness, value alignment) improving in tandem along a 45° line. This provides a clear, healthy path for development, avoiding shortsightedness that neglects risks.

The End of Evaluation Is the Beginning of AGI

Returning to the core question: how to measure “intelligence”?

Unlike physical quantities, intelligence has no standard unit or absolute measurement. Every evaluation is a limited observation from a specific angle.

The deeper paradox is the relationship between evaluator and evaluatee. “Evaluation is fundamentally a top-down measure,” Zhai Guangtao pointed out. “Humans can evaluate current AI because, in terms of general intelligence, humans still hold the upper hand. But when an agent surpasses humans in all dimensions, what standard do we use to judge it?”

“A perfect, universal evaluation system itself would be equivalent to AGI.”

“So, when evaluation becomes irrelevant, true AGI may have arrived,” Zhai Guangtao said. “As long as we need to keep measuring with a ruler, we are still far from that stage.”

The phrase “no longer important” does not mean evaluation disappears but that humans as external judges will fade away. In the future, perfect self-evaluation will be an intrinsic, self-reflective function embedded within AGI itself.

This points to the next “Holy Grail” of AI development: building the ultimate evaluation system is itself the most crucial path to AGI.

Based on this vision, the Sainan team is designing a grand blueprint to incorporate embodied intelligence, AI safety, hardware-software systems, and vertical applications into a unified, authoritative evaluation framework.

“The goal of Sainan is to become the future standard for measuring intelligence, providing an authoritative reference for healthy AI development, guiding global innovation away from chaotic ‘score-chasing’ toward value-creating, human-centered progress,” Zhai Guangtao said.

In this sense, the work of the Sainan team is not just designing measurement tools but exploring the essence of intelligence itself. Every improvement in evaluation methods pushes the boundary of intelligence; every new dimension adds a necessary perspective to the unknown facets of intelligence.

This may be the deepest paradox and most beautiful vision in AI evaluation: its ultimate success lies in its own disappearance.

Subscribe to QQ Insights