Gary Marcus Shocking Statement: Building AGI on Pure LLMs Is Completely Hopeless! MIT, Chicago, Harvard Papers Go Viral}
Gary Marcus criticizes the myth of LLMs understanding and reasoning, citing recent research showing fundamental flaws, declaring the end of hopes for AGI based solely on pure LLMs.


Today, renowned AI researcher and cognitive scientist Gary Marcus retweeted a groundbreaking paper co-authored by MIT, University of Chicago, and Harvard, stating: "The situation with LLMs and their myth of understanding and reasoning has become worse — and much worse."
This study reveals a pattern called Potemkin reasoning inconsistency (see Figure 1 below). It shows that even top models like o3 frequently make such errors. With machines that cannot even maintain internal consistency, creating Artificial General Intelligence (AGI) is fundamentally impossible.
As the paper states: success on benchmarks only proves Potemkin understanding — an illusion driven by answers that are fundamentally incompatible with human concept understanding... These failures reflect not just errors in understanding but deep internal contradictions in concept representations.
Gary Marcus believes this announces the end of any hope to build AGI solely based on pure LLMs. He also tagged Geoffrey Hinton, claiming the latter will fail (checkmate).

He then continued sharing his thoughts on Twitter, noting that informal tests suggest models like o3 are less prone to simple Potemkin errors, but are not immune.
As shown in the left image (below), the model correctly explains the definition of a haiku but incorrectly asserts that "asphalt shimmers" fits the five-syllable requirement for the last line of a haiku; only after follow-up questions (right) does it grudgingly admit the mistake. This underscores the core issue: lack of reliability.

After carefully reviewing the paper, Gary Marcus agrees that its core point is correct — LLMs are indeed prone to contradictions (like the Potemkin errors). However, he finds the specific experimental examples unconvincing.
Based on his own informal experiments (including an example to be published next week), Marcus is convinced that a real problem exists. But further research is needed to understand how widespread it is and how it affects different models.

His views sparked heated discussion. Some asked whether he truly believes LLMs are improving, to which he responded positively but also suggested they might have reached diminishing returns.

Others argued that we don’t need LLMs to understand — just perform well enough, since even humans don’t always understand each other perfectly.
DeepMind senior scientist Prateek Jain commented that the paper and its evaluation methods are interesting. He tested all examples with Gemini 2.5 Pro, which answered correctly, and is curious about its performance on full datasets and specific errors.

Some critics pointed out that the paper mainly describes a well-known failure mode of current LLMs, questioning why it’s considered a “failure”.

Next, let’s explore what this paper actually discusses and whether it truly supports Marcus’s bold claims.
Paper Overview

- Title: Potemkin Understanding in Large Language Models
- Link: https://arxiv.org/pdf/2506.21521
The evaluation of LLMs often relies on benchmark datasets. But is high performance on carefully curated questions truly indicative of understanding? This paper introduces a formal framework to explore this question. The key point: benchmarks designed for humans only validly assess LLMs if their misunderstanding modes are similar to humans. Otherwise, high scores only reflect Potemkin understanding — answers that seem correct but hide a deep conceptual disconnect.
The authors propose two methods to quantify the prevalence of the Potemkin phenomenon: one based on specialized benchmarks across three domains (literature, game theory, cognitive biases), and a general program estimating its lower bound. Results show the phenomenon is widespread across models, tasks, and fields, revealing deep internal contradictions in concept representations.

Despite high accuracy in defining concepts (94.2%), models’ performance drops sharply in application tasks, as shown by high Potemkin rates. This phenomenon is pervasive across all models, concepts, and domains studied.

The study also introduces an automated method to assess the internal inconsistency of models. It involves prompting models to generate an instance of a concept, then re-evaluating whether the model recognizes its own output as an instance, revealing significant internal conflicts.

Results show that internal inconsistency scores range from 0.02 to 0.64 across models, concepts, and fields, indicating that models often hold conflicting understandings of the same concept. This confirms that the Potemkin phenomenon stems from both misunderstanding and inconsistent application of concepts.