ICML2025 | Tsinghua Medical Engineering Platform Proposes 'Full-Cycle' Medical Capability Evaluation Framework for Large Models: MultiCogEval}

Tsinghua's MultiCogEval framework systematically assesses large language models' medical abilities across all cognitive levels, addressing real-world clinical challenges.

ICML2025 | Tsinghua Medical Engineering Platform Proposes 'Full-Cycle' Medical Capability Evaluation Framework for Large Models: MultiCogEval}

Work led by Professor Wu and Assistant Researcher Liu Xien from Tsinghua University’s Interdisciplinary Medical Engineering and Automation Platform, in collaboration with Beijing University of Posts and Telecommunications, iFlytek, and Wunwun Chip, culminated in the development of the MultiCogEval framework. The first author, Zhou Yuxuan, a PhD student at Tsinghua’s Department of Electronic Engineering, focuses on evaluating and optimizing large models' medical capabilities, having previously proposed the multifaceted evaluation frameworks MultifacetEval (IJCAI 2024) and PretexEval (ICLR 2025).

Historically, medical training involves progressing from basic knowledge to clinical practice and residency, mirroring the development of human cognition. Existing medical model evaluations often focus on single tasks like question-answering or diagnosis, which fail to reflect the multi-layered, full-process capabilities needed in real clinical scenarios.

To address this, the team introduced MultiCogEval, a comprehensive framework covering different cognitive levels of medical tasks, enabling systematic evaluation of large models' abilities across the entire clinical process, providing new insights into their strengths and limitations.

Figure 1: The relationship between human medical cognition development and large model evaluation across cognitive levels.

MultiCogEval: A Multi-Cognitive Level Medical Evaluation Framework

Inspired by the training pathway of medical students, MultiCogEval assesses models at three levels:

  • Basic Knowledge Mastery: Tests memory and understanding of fundamental medical knowledge using multiple-choice questions from existing benchmarks.
  • Integrated Knowledge Application: Evaluates the ability to apply knowledge in complex clinical tasks, involving multi-step reasoning and decision-making under limited information, designed to mimic real clinical scenarios.
  • Scenario-Based Problem Solving: Assesses active planning and diagnostic decision-making in real clinical environments, simulating tasks like active diagnosis, ordering tests, and making treatment decisions based on incomplete data.

Figure 2: The Multi-Cognitive Level Medical Evaluation Framework, MultiCogEval.

Experimental Results: Clinical Problem-Solving Capabilities of Current Large Models Still Need Improvement

Using this framework, the team systematically evaluated several leading large models such as GPT-4o, DeepSeek-V3, and Llama3-70B. While these models perform well (>60% accuracy) on basic knowledge tasks, their performance drops significantly (~20%) on integrated knowledge tasks, and further declines in scenario-based problem solving, with DeepSeek-V3 achieving only 19.4% accuracy in full diagnosis chains. This highlights the gap between current language models' knowledge and their ability to handle complex, real-world clinical problems.

Table 1: Evaluation performance of various general large models across different levels in MultiCogEval.

Further, the team compared models with and without domain-specific fine-tuning (SFT). Fine-tuning improved low- and mid-level clinical abilities by up to 15%, but had limited or negative effects on high-level scenario tasks, indicating room for improvement in models' active reasoning and decision-making capabilities.

Figure 3: Performance comparison of various medical models across different cognitive levels in MultiCogEval.

Finally, the team explored inference-time scaling, finding that models enhanced with reasoning capabilities outperform instruction-tuned counterparts across all levels, especially in mid-level tasks (e.g., DeepSeek-R1 improves by 23.1%). However, high-level clinical reasoning remains a challenge, requiring further research to enable models to actively plan and gather decision-making information in real clinical settings.

Table 2: Performance comparison between inference-augmented models and instruction-tuned models across different tasks.

Conclusion

This study introduces the first comprehensive multi-cognitive level medical evaluation framework, MultiCogEval, systematically assessing models' abilities from basic knowledge to complex clinical reasoning. The findings reveal:

  1. Current models excel at basic knowledge but struggle with higher-level, real-world clinical tasks, especially active information gathering and reasoning.
  2. Fine-tuning improves low- and mid-level abilities but has limited impact on high-level scenario tasks.
  3. Inference-time scaling significantly enhances performance but does not fully solve the challenge of active reasoning in complex clinical environments.

The release of MultiCogEval lays a solid foundation for future development and evaluation of medical large models, aiming to promote more robust, trustworthy, and practical AI-driven healthcare solutions, ultimately building "trustworthy AI doctors."

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe