By Insights Team in AI — 14 Jul 2025

OpenAI's GPT-3 Ranks First on the New 'Answering Scientific Questions AI Rankings', DeepSeek's R1 Comes Second}

OpenAI's GPT-3 leads the latest AI rankings for scientific question answering, with DeepSeek's R1 in second place, highlighting rapid advancements in AI's scientific capabilities.

Editor: Bai Caiye

According to the recent benchmark platform, GPT-3, developed by OpenAI's creators, has been rated as the best AI tool for answering questions across multiple scientific fields.

Developed by the Allen Institute for AI (AI2) in Seattle, the SciArena platform ranks answers to scientific questions based on votes from 102 researchers evaluating 23 large language models (LLMs).

GPT-3 from OpenAI is rated as the top model in natural sciences, healthcare, engineering, and social sciences.

SciArena: https://allenai.org/blog/sciarena

DeepSeek's DeepSeek-R1 ranks second in natural sciences and fourth in engineering. Google's Gemini-2.5-Pro ranks third in natural sciences and fifth in both engineering and healthcare.

Image: Partial ranking display. (Source: SciArena official website)

AI2 researcher Arman Cohan notes that users favor GPT-3 because it provides detailed references and nuanced responses. However, explaining the performance gap among models is challenging due to proprietary data and optimization goals.

SciArena is a new platform designed to evaluate AI performance on scientific tasks, one of the first to rank models based on crowdsourced feedback. Rahul Shome from Australian National University comments: "SciArena encourages serious evaluation of LLM-assisted literature tasks."

Randomized Selection

To rank the 23 models, SciArena invites researchers to submit scientific questions. Answers are generated by two randomly selected models referencing Semantic Scholar, an AI research tool also developed by AI2. Users vote on the best answer, with no significant difference or poor performance noted between models.

The platform is open to the public, allowing free question submissions. Users can view answers from both models and vote, but only verified users' votes are counted. The rankings are updated regularly.

Jonathan Kummerfeld from the University of Sydney emphasizes that asking scientific questions to LLMs and trusting their answers helps researchers stay updated on recent literature, potentially uncovering overlooked research.

Kummerfeld also highlights that the platform promotes AI innovation by providing transparent progress metrics. He notes that it is carefully designed to prevent manipulation, similar to other benchmarks.

He warns that user participation depends on perceived value: "As long as users find it worthwhile, it will succeed; otherwise, engagement may decline." Cohan adds that free access and the latest models incentivize participation, and references from Semantic Scholar indicate the responses are useful for researchers.

Shome reminds scientists that LLM-generated texts may conflict with cited papers, misunderstand terminology, or fail to answer accurately. He emphasizes that reading abstracts alone cannot replace reading full papers.

About SciArena

SciArena is an open evaluation platform where researchers compare and vote on different foundational models' performance on scientific literature tasks. It is community-driven, similar to Chatbot Arena, but tailored for scientific exploration.

Image: Distribution of human preference data across scientific disciplines on SciArena as of June 30, 2025. (Source: SciArena official website)

The platform consists of three main parts:

SciArena Platform: Researchers submit questions, view answers from different models, and vote for the best.
Leaderboard: Based on community votes and Elo ratings, it provides dynamic performance rankings.
SciArena-Eval: A meta-evaluation benchmark using collected human preferences to assess model evaluation accuracy.

Subscribe to QQ Insights