By Insights Team in AI — 28 Jul 2025

Do Multimodal Large Models Truly 'Understand' the World? Revealing Core Knowledge Deficits in MLLMs}

This article explores the core knowledge gaps in multimodal large language models (MLLMs), highlighting their limitations in basic cognition and implications for true understanding of the world.

However, a closer look reveals a fact: these models often fail in tasks that are simple, intuitive, and easily handled by human children.

For example, questions like "Is the toy still there after being covered?", "Does the volume change when liquid is poured into different-shaped containers?", and "Will two objects collide when approaching each other?" highlight their shortcomings.

Does this mean that the inherent cognitive structure of MLLMs lacks the basic knowledge mechanisms that support early human learning? In other words, do they lack core knowledge?

ICML 2025 high-scoring paper (initial score 4443) reveals the 'core cognitive blind spots' of MLLMs.

Research from UC San Diego titled "Core Knowledge Deficits in Multi-Modal Language Models" provides a systematic and in-depth analysis of this issue.

Paper Title: Core Knowledge Deficits in Multi-Modal Language Models
Link: https://arxiv.org/pdf/2410.10855

Findings: Current mainstream MLLMs widely lack core cognitive abilities, and these cannot be naturally acquired by scaling the models.

To address this, the authors developed an innovative multimodal assessment system CoreCognition and proposed a unique Concept Hacking method to reveal whether models truly understand the core knowledge behind tasks, or just guess correctly.

CoreCognition is a cross-modal cognitive evaluation benchmark designed to test models' understanding of fundamental concepts across different stages of cognitive development, inspired by Piaget's developmental theory.

Four Major Findings

Models lack basic cognition: They perform poorly on simple tasks like boundary perception, object permanence, and spatial awareness, indicating a lack of understanding of fundamental world structures.
High-level reasoning is not supported by basic cognition: Advanced reasoning abilities are not built upon solid foundational cognition, leading to robustness issues.
Scaling models does not significantly improve core cognition: Increasing parameters enhances high-level reasoning but does little for basic cognition, sometimes even degrading it.
Reasoning models show no clear advantage: System-2 reasoning does not effectively help models learn or infer basic cognition, suggesting a deficiency in pretraining.

Concept Hacking: Revealing 'False Understanding'

To verify if models truly grasp core concepts, the authors introduced Concept Hacking: constructing control and manipulated test sets with reversed key features to distinguish genuine understanding from superficial guessing.

Results show many models perform well on normal tasks but fail when key features are altered, indicating reliance on shortcuts rather than true comprehension.

Implications and Future Directions

Understanding the significance: The models lack core knowledge, and scaling alone cannot solve this fundamental issue, echoing Moravec's paradox that simple human tasks are hardest for AI.
Scientific insights: Human cognition is built on core knowledge scaffolds, which current models lack.
Technical challenges: Future work may involve explicitly injecting common sense like physics and spatial awareness during pretraining, and developing cognitive-guided training mechanisms.

Author Profiles:

Li Yijiang: Master's in Data Science from Johns Hopkins University, now a first-year PhD student at UC San Diego, focusing on efficient and robust AI learning in multimodal and interactive environments.
Gao Yingying: Master's from Johns Hopkins University, pursuing a PhD in computer science, specializing in explainability of vision-language models and autonomous navigation for low-vision users.
Zhao Tianwei: Master's student at Johns Hopkins University, interested in evaluating and enhancing reasoning abilities of multimodal models from a cognitive science perspective.
Wang Bingyang: MS, BS, and BBA from Emory University, focusing on multimodal fusion and signal extraction from mixed modalities.
Sun Haoran: MS in Applied Mathematics from Johns Hopkins University, researching medical data science and machine learning in cardiology.
Luo Dezhi: Senior at University of Michigan, with research experience in psychology and AI, focusing on consciousness and core cognition theories.
Hokin Deng: Visiting scientist at Carnegie Mellon University, former neural engineer at Johns Hopkins and Meta Reality Labs, co-leader of GrowAI, focusing on interdisciplinary research in AI, neuroscience, and philosophy.

Subscribe to QQ Insights