ScienceBoard: The First Multimodal Agent Evaluation Environment for Scientific Tasks, Real Interactions, and Automated Assessment}

ScienceBoard is the pioneering multimodal evaluation platform for autonomous scientific agents, enabling real interactions, complex tasks, and automated performance measurement to advance AI in scientific research.

ScienceBoard: The First Multimodal Agent Evaluation Environment for Scientific Tasks, Real Interactions, and Automated Assessment}

Image

Author: Sun Qiushi, a PhD student at the School of Data Science, University of Hong Kong, with a master's from NUS Data Science. Focuses on Computer-Using Agents and Code Intelligence, with multiple publications at ACL, EMNLP, ICLR, COLING, and more. The OS-Copilot team previously released OS-Atlas, OS-Genesis, and SeeClick, widely used in academia and industry.

AI-powered scientific research agents are quietly evolving.

1 Background and Motivation

With rapid advances in LLMs and VLMs, AI's role in natural language processing, programming, and image understanding has expanded. In scientific research, these models are becoming new collaborators in the research workflow.

Initially, AI in science was mainly an analyzer—helping analyze data, write papers, and generate charts. Now, with the emergence of Computer-Using Agents (CUA), this role is transforming. These agents can operate software like humans—clicking, dragging, inputting commands, or coding—automating real scientific software control. They are no longer just answering questions but actively collaborating to complete scientific tasks with execution capabilities.

1-1 From Language Understanding to Scientific Execution: New Challenges

In complex scientific scenarios, diverse software tools, long-term workflows, and cross-modal information make “AI truly completing a scientific task” much harder than simple question answering. For example, modeling protein structures requires biological software, star trajectory analysis involves astronomical simulators, and results must be formatted into LaTeX documents. Achieving this requires:

  • Software Operation Skills: Using GUI and CLI to control complex scientific tools;
  • Domain Understanding: Grasping scientific concepts and background knowledge;
  • Cross-Modal Perception and Planning: Reasoning and acting across graphical interfaces, terminal commands, and scientific data.

However, current multimodal agent systems mostly excel in web, e-commerce, and general tasks, still in early stages for scientific applications. A key reason is the lack of a real, systematic scientific environment and evaluation benchmark to push agents from “talking and coding” to “doing.”

1-2 Gaps in Scientific Tasks: Environment and Evaluation

While community efforts like WebArena and OSWorld evaluate CUAs in general scenarios, they lack the complexity of real scientific work. Benchmarks like ScienceQA and SciCode focus on QA and static coding, but real scientific software involves non-standard I/O, complex interfaces, multi-step operations, and configuration before execution—posing unprecedented challenges for agents. We need (1) a reliable environment for autonomous exploration and (2) a multimodal, multi-domain benchmark to assess how much scientific automation can be achieved.

Image

In this context, we introduce ScienceBoard: the first evaluation environment for multimodal autonomous agents focused on scientific tasks, real interactions, and automated assessment, aiming to fundamentally promote progress toward “AI that autonomously completes scientific workflows.”

Image

2 Infrastructure of ScienceBoard: An Interactive Environment for Scientific Tasks

2-1 Multi-Domain Scientific Software Integration

Built on Ubuntu VM, ScienceBoard integrates multiple open-source scientific software, systematically reconstructed for task interaction via CLI/GUI. Features include:

  • Multi-Domain Software: Six fields like biochemistry, astronomy, GIS, etc.
  • Dual-Modal Control: GUI and CLI support, with screenshots, accessibility trees, and set-of-marks inputs, enabling flexible agent interactions.
  • Auto-Initialization: Scripts, configs, and auxiliary data ensure reproducibility and consistent starting points.
  • Reliable Automated Evaluation: Custom evaluation functions support numeric, range, and state comparisons for execution-level assessment.
Image

2-2 Action Space Design

To enable agents to use a unified interface across tasks, ScienceBoard extends previous work to define a general action space, including:

  • GUI Actions: e.g., CLICK [x, y], SCROLL [Δy], TYPE [“text”]
  • CLI Commands: Input commands in terminal/software and get feedback
  • Other Calls: ocall_api for external API access, oanswer for QA responses
  • Flow Control: DONE, FAIL signals to terminate interactions

This design allows diverse agents built with LLM/VLM to perform structured, cross-software, cross-modal interactions within ScienceBoard, achieving universal execution capabilities.

3 ScienceBoard Benchmark: High-Quality Scientific Tasks Dataset

Based on this environment, ScienceBoard constructs a challenging, standardized benchmark with 169 real scientific tasks across 6 fields, covering software setup, simulation, visualization, data analysis, documentation, and complex workflows.

3-1 Scientific Exploration Tasks

Tasks are designed to be realistic, complex, and measurable, created through a hybrid process of manual design and automated validation, ensuring clarity, rationality, and reproducibility.

Image

3-2 Multi-Dimensional Evaluation Benchmark

The current version includes 169 tasks across fields like:

Biochemistry (ChimeraX)Predict protein structures with AlphaFold
Astronomy (Celestia)Adjust Julian date, display planetary orbits
Mathematics (KAlgebra)Plot Bernoulli curve, analyze limits
Geography (GrassGIS)Overlay terrain layers, calculate boundary areas
Theorem Proving (Lean 4)Construct formal proofs via induction
Research Writing (TeXstudio)Write abstracts, insert generated charts

Image

Tasks are categorized into four difficulty levels:

  • Easy (~54%): Single-step configuration, simple calculations, interface operations
  • Medium (~28%): Multi-step instructions, logic reasoning, cross-modal memory
  • Hard (~17%): Long-horizon planning, precise GUI positioning, multi-program collaboration
  • Open Problems: Unsolved challenging tasks beyond current SOTA capabilities

4 Experiments and Evaluation

We evaluated representative models (commercial, open-source, GUI-based) on ScienceBoard. Results show even top multimodal models are far from mature in real scientific workflows.

4-1 Main Experiments

Success rates:

  • GPT-4o and Claude 3.5 outperform open-source models but only achieve about 15% success rate on average;
  • Open-source models like InternVL3 and Qwen2.5-VL outperform some commercial models in certain tasks but lack cross-domain stability;
  • Lightweight GUI Action Models like OS-ATLAS and UGround are limited in long and cross-modal tasks.
Image

Analysis shows that completing scientific workflows requires multi-modal reasoning, long-term planning, and domain knowledge, far beyond simple web or app interactions.

Many failures are due to execution strategies, not knowledge gaps. For example, correctly understanding “export protein structure” but clicking in the wrong order causes failure.

4-2 Decomposing Planning and Action

Further analysis reveals that many agents “know what to do” but “do it poorly.” Models like GPT-4o show strong planning but struggle with precise interface actions, indicating a gap between “thinking” and “doing.”

To address this, we decouple planning and execution, creating modular agents: GPT-4o generates high-level plans, while open-source VLMs or GUI models execute specific actions.

Image

Experiments show this modular approach significantly improves success rates, especially in complex, multi-step scientific software tasks, making the system more robust.

5 Future Outlook

ScienceBoard demonstrates that current agent bottlenecks are not just operational but also knowledge-related. Future directions include enabling agents to truly “understand science,” possibly through resource-based learning or external knowledge integration.

Another promising avenue is multi-agent systems—combining planners, domain experts, and executors into flexible, task-specific teams that can handle data analysis, visualization, and writing, transforming AI into a collaborative scientific partner.

Long-term, ScienceBoard lays the foundation for lab-level AI automation—from virtual assistants to physical robots—making AI-driven scientific discovery a tangible reality.

6 Conclusion

As the first comprehensive multimodal evaluation environment for scientific exploration, ScienceBoard offers a real, interactive research platform with challenging tasks and automated assessment. Despite current models’ limitations, it provides a reproducible, scalable starting point toward fully autonomous AI scientists, illuminating the path forward.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe