By Insights Team in AI — 26 Jun 2025

5 Major Models Take the 'Shandong Exam'! Gemini and Doubao Lead in Literature and Science}

Five large AI models participated in the 2025 Shandong college entrance exam, with Gemini and Doubao achieving top scores in science and literature, showcasing rapid AI progress in academic assessments.

This year's large models are already aiming for top scores—are they ready for next year's college entrance exam?

Indeed, AI is now close to conquering the college entrance exam.

Recently, five large models participated in the 2025 Shandong college entrance exam, evaluated using the traditional liberal arts and science categories: Doubao Seed 1.6-Thinking scored 683 points to top the liberal arts, while Gemini 2.5 Pro scored 655 points to lead in science.

Evaluation conducted by ByteDance Seed team. They assembled five mainstream reasoning models—Seed 1.6-Thinking, DeepSeek-R1-0528, and foreign models Gemini-2.5-Pro-0605, Claude-Sonnet-4, OpenAI-o3-high-0416—using the 2025 Shandong exam questions (including national core subjects and elective subjects) for comprehensive blind testing, with scores based on a 750-point scale.

To ensure fairness, the team used API testing and referenced official grading standards. Multiple-choice and fill-in-the-blank questions were machine-graded with manual quality checks, while essay questions were anonymously evaluated by two experienced high school teachers, with multiple rounds of quality control.

The entire test was conducted without prompting engineering; all inputs were original exam questions, with DeepSeek R1 receiving only the question text, while other models received both text and images. Final rankings were based on combined scores of three core subjects plus three comprehensive subjects.

All five models scored over 620 in the liberal arts section. Notably, Doubao’s 683 points could potentially qualify for Tsinghua or Peking University under Shandong’s scoring system. In science, the scores varied more, with Gemini and Doubao reaching the baseline of key universities, while Claude 4 and o3 scored below 600.

Last year, these models barely passed the basic line in all subjects, often producing answers with shallow reasoning and weak logical chains, giving an impression of “guesswork.” However, within a year, technological advances have significantly improved their reasoning and problem-solving abilities.

Small differences in language scores, science scores still lag behind

In basic subjects like Chinese, Math, and English, the models performed at top student levels with little variation. However, o3’s Chinese score was lower due to an off-topic essay, dragging down its overall score.

In elective subjects, although models made progress in science, their scores still lag behind in comparison to liberal arts.

Next, we analyze the detailed scores based on the report’s scoring breakdown: Score Breakdown.

Chinese: Those who write essays win the game

In this evaluation, Doubao scored 128 in Chinese, ranking first in the subject. Gemini was close behind with 126, while DeepSeek and Claude 4 scored 118 and 117 respectively. O3 scored 95 due to an off-topic essay.

Overall, models excelled in multiple-choice and reading comprehension, demonstrating strong language understanding, information extraction, and basic reasoning skills. Many analytical questions are template-based, allowing models to generate stereotyped answers like “expressed homesickness.”

They also performed well in memorizing classical quotes, with all five models achieving full marks. Pretraining on vast texts, including classical poetry, textbooks, and exam questions, enables quick recall under prompts.

However, essay performance varied: Gemini scored 52, Doubao 48, while o3 only managed 20. The main issue was that model essays lacked depth, critical thinking, and emotional warmth, often sticking to clear opinions and structured formats but lacking insight.

^{o3’s high school exam essay}

In the most challenging questions, models often failed to provide complete solutions, missing proof steps or producing less rigorous reasoning, indicating room for improvement in image recognition and understanding.

Mathematics: From failing to scoring over 140 in a year

Deep reasoning capabilities led to a dramatic leap in math scores. Unlike last year’s poor performance, many models now score over 140, with DeepSeek R1, Doubao, and Gemini scoring 145, 141, and 140 respectively.

Comparison with previous evaluations shows some differences, mainly in problem-solving steps, reflecting the inherent randomness in model responses.

DeepSeek performed well except for losing points on question 6 (which everyone missed). Doubao and Gemini made errors on the final complex problem. Claude 4 and o3 lost points on the penultimate question, with Claude 4 also missing multiple-choice options, ranking lowest.

The challenging question 6 involved mixed visual elements like boxes, dashed lines, arrows, and Chinese characters, which models struggled to interpret accurately, highlighting ongoing challenges in image understanding.

^{Question 6 from the latest exam paper}

In the most difficult problem, many models failed to produce perfect solutions, often missing proof steps or producing less rigorous reasoning, indicating areas for improvement in complex problem-solving.

English: All scores above 140, with little variation

In English, all five models scored over 140, with the exception of Gemini, which made a mistake on one multiple-choice question. Most errors were in writing, with models producing well-structured, fluent responses.

^{The above image is a translated and formatted version of the official evaluation table.}

Interestingly, Gemini identified the correct answer during reasoning but introduced unsupported assumptions later, leading to wrong choices. In essay writing, Doubao, Gemini, and Claude 4 scored 12 points, providing complete, clear, and insightful responses. O3 and DeepSeek scored 11 points, with minor issues like pronoun inaccuracies or repetitive sentence structures.

Strong performance in politics, history, geography, weaker in science image questions

In the comprehensive subjects, Doubao scored 270 points, leading significantly, especially in geography (94) and history (92). Their ability to handle structured data and logical reasoning contributed to high scores.

O3 scored balanced results, with no major weaknesses, but had lower scores in politics due to insufficient analysis. Gemini and Claude 4 had similar overall scores, with no standout strengths or weaknesses.

DeepSeek’s performance was less impressive, with a total score of 225, mainly due to poor performance in history (67), caused by failure to recognize materials in question 18.

In science, the models’ scores were below the top threshold, with Gemini leading at 248, followed by Doubao at 235, and Claude 4 at 198. The lower scores in science are partly due to blurry images used in testing, which limited multimodal capabilities.

When higher-quality exam images were used, Doubao’s scores in biology and chemistry improved by nearly 30 points, reaching a total of 676, demonstrating the potential of multimodal reasoning.

^{Example of multimodal input}

In physics, models often used advanced knowledge beyond the curriculum, but since no prompting engineering was used, they were unaware of solution constraints.

From underperformer to top scorer in a year: How do large models become academic elites?

Last year, models scored 562 in liberal arts and 469.5 in science in Henan’s college entrance exam. Within a year, scores increased by over 100 points in both subjects, thanks to continuous technological innovation and optimization in reasoning and multimodal processing, especially in Gemini, OpenAI models, and Doubao.

In March, Google launched Gemini 2.5 Pro, capable of deep reasoning via chain-of-thought before output, significantly improving math, science, and coding performance, and handling complex data from diverse sources like text, audio, images, videos, and codebases.

OpenAI’s o3 is their most powerful reasoning model, capable of longer reflection and native image integration, allowing simple image editing like cropping and rotating, enabling more human-like step-by-step reasoning and perception in complex tasks.

Doubao’s 1.6 series, announced half a month ago, uses multimodal pretraining, divided into stages: pure text, multimodal mixed training, and long-context training, enhancing both textual and visual understanding, with support for up to 256K context length.

Based on this pretraining, the team developed Seed1.6-Thinking, which combines visual understanding and deep reasoning, and is the main participant in the Shandong exam challenge.

Seed1.6-Thinking employs multi-stage reinforcement learning and reward models, with increased high-quality data, deep fusion of visual and language models, and extended reasoning capabilities.

Should we let large models take the college entrance exam next year?

“AI taking the college entrance exam” has become an annual hot topic. In an era when image recognition and NLP are still developing, standardized tests are a way to measure AI progress.

This cycle has shifted from “Can AI do the test” to “What can AI achieve,” including understanding deep meanings and emotions. In 2025, models have made significant progress in text comprehension, generation, multimodal understanding, and reasoning, beginning to grasp deep logic and values behind questions, and understanding discipline-specific images, with responses showing more depth.

This progress is reflected in exam scores: from barely passing last year to aiming for top universities in 2025, models have achieved a leap from “ordinary undergraduates” to “Double First-Class” universities within a year. The college entrance exam, once a benchmark for AI intelligence, now seems less challenging.

Next year, models like Gemini and Doubao might skip the exam altogether, moving beyond standardized tests to focus on scientific research, art, programming, and solving real-world complex problems without clear answers, reducing human repetitive work.

We believe that in the near future, large models will become experts across various fields.

Subscribe to QQ Insights