Cognitive Test Based on Working Memory Reveals LLM Retrieval Limitations: 100% Confusion with Invalid and Correct Answers}

A study shows all large language models (LLMs) struggle with retrieval tasks, with 100% hallucination of invalid info, highlighting fundamental memory and reasoning limits, published at ICML 2025.

Cognitive Test Based on Working Memory Reveals LLM Retrieval Limitations: 100% Confusion with Invalid and Correct Answers}

This article uncovers a universal information retrieval problem affecting all large language models (LLMs). The task is trivial for humans but causes significant errors in all tested LLMs, severely impairing their global memory and long reasoning capabilities. The related paper has been accepted at the ICML 2025 Workshop on Long Context Foundation Models.

Core Experimental Setup

The task involves providing a model with a stream of dynamically updated data, e.g., “Blood Pressure=120, Bp=135, Bp=119,” then asking: “What is the last value of blood pressure (BP)?”

For humans, this is straightforward—the answer is clearly 119. This pattern is common in finance (account balance changes), healthcare (tracking physiological metrics), and all fields requiring dynamic data tracking. All mainstream LLMs (from GPT-4.1, Llama-4, DeepSeek-V3, to Llama-3, Qwen-2.5, with parameter scales from 0.6B to 600B+) fail to reliably extract the last number, often showing errors with a clear mathematical pattern.

Key Finding: Universal Decay Curve

  • Accuracy declines logarithmically with increasing update counts across all models.
  • With more interference, accuracy stabilizes at 0%, with 100% hallucination and failure to produce correct answers.
  • This consistent decay pattern across architectures and scales suggests the root issue may lie in the Transformer architecture or its attention mechanisms.

Figure 1 shows that when LLMs need to retrieve specific information after many semantically similar distractors, their accuracy drops sharply, following a log-linear decline observed across all mainstream models.

Figure 2 illustrates a basic input example for LLM-PI testing: the model processes a stream of continuously updated key-value pairs (e.g., “visual art” with multiple values), and after updates, must accurately retrieve the final value for each key (shown in bold).

Experimental Setup

The test involves handling 1 to 46 different keys, each updated between 1 and 400 times. The updates are randomly mixed, and the model’s accuracy in retrieving the last value is measured.

Comparison with Humans

This task is inherently simple: (1) no complex search involved, (2) no logical difficulty. Humans can easily focus on the latest info, almost unaffected by previous data. Error analysis shows models often mistakenly extract irrelevant earlier updates, indicating current LLMs struggle to effectively ignore or filter out non-target (old) information. Further error distribution analysis reveals that LLMs exhibit behaviors similar to limited working memory capacity: they seem to record key-value pairs within a finite representational space, and once the number of updates exceeds this capacity, retrieval performance collapses.

Multiple ways to trigger retrieval failure are identified, all following the same log decay curve: increasing the number of final keys or the length of target values significantly impacts accuracy. Similar phenomena are observed in human experiments as well.

Phenomenon Explanation: “Unable to Forget”

Large models cannot ignore or forget irrelevant information, leading to complete search failure. Paradoxically, even straightforward natural language interventions—such as explicitly highlighting answer regions or instructing models to “focus on the latest” or “forget previous info”—fail to significantly improve performance. This indicates the interference effect is so strong that it overrides explicit natural language commands, forcing models to attend to old information. This suggests that addressing interference likely requires fundamental architectural or training paradigm changes, not just prompt engineering.

Why Do LLMs Struggle to Consistently Retrieve Latest Info?

The failure is systemic, not random. As interference increases, errors evolve in stages:

  • Early: Errors mainly come from the most recent distractor values.
  • Mid: Errors spread across the entire text, affecting any part of the input.
  • Late: The output becomes highly dispersed, with many retrieved values never inputted, or hallucinated.

Figure 4 shows the model’s response distribution for a key across 11 update intervals. As updates increase (left to right), the proportion of correct final values (light yellow) drops sharply. Notably, errors tend to cluster around the last update (e.g., bins 10-11) or disperse to earlier updates (bins 1-9), illustrating the collapse of memory retrieval under overload.

Breakdown of Top-Down Control Failure

Unlike humans, LLMs’ performance in this task is almost unaffected by top-down prompts. Tests with various prompt variants—highlighting answer regions, instructing to “focus on the latest,” or “forget previous info”—all fail to significantly improve accuracy or alter the log-linear decay pattern. Even models with Chain-of-Thought (CoT) reasoning show no improvement, indicating that interference issues are rooted in the model’s architecture or training, not just prompt design. This points to the need for fundamental innovations in model design, such as attention mechanisms or explicit anti-interference training signals.

Figure 13 demonstrates that CoT reasoning offers almost no resistance to interference, with performance curves overlapping or even worsening compared to baseline models, confirming that the core issue lies in the model’s fundamental mechanisms.

Figure 10 shows five different natural language intervention strategies—such as instructing the model to “forget” specific keys, focus on subsequent info, self-assess relevance, reset conversation, or use mock QA resets—aimed at combating interference. All proved ineffective, with the same log decay pattern persisting. Only the structured hack-Mock QA reset (orange dashed line), a deliberately designed “hack method,” provided some improvement but still could not prevent the overall decline in accuracy as information overload increased.

Interference as an Independent Variable

Contrary to the common belief that longer input text causes attention dilution, experiments controlling interference strength show that performance decline is primarily driven by interference intensity, not just text length. Even with fixed input length, increasing interference causes error rates to rise logarithmically. This provides a new perspective on the poor performance of models like DeepMind’s MRCR and OpenAI’s Open MRCR, which insert many similar items into long texts, revealing weaknesses in distinguishing similar information. This work offers a more fundamental explanation: the core challenge is the models’ inability to handle interference, not just the volume of data.

It also highlights that the failure in long-text retrieval tasks is caused by the models’ limited anti-interference capacity, which can be precisely quantified using the log-linear decay framework. The findings are highly relevant for applications like legal and financial tasks, where frequent updates and precise extraction are critical, and where interference effects can cause significant retrieval failures.

Implications for Cognitive Science

1. Bridge to Cognitive Science: The test is a standard measure of working memory capacity and interference resistance in cognitive science. The results suggest that LLMs exhibit a form of limited capacity similar to human working memory, with their Anti-Interference Capacity being a key metric.

2. Current Limitations: The widespread failure of LLMs indicates they lack human-like top-down control to optimize context utilization. Improving this ability is crucial for tasks relying on dynamic data, such as finance and healthcare, and for long reasoning capabilities.

Visual Demonstration

To illustrate this phenomenon, an interactive demo website (https://zhuangzigiantfish.github.io/Unable-to-Forget/) allows users to click the “Mix” button, which shows a 3-4 second animation demonstrating the systematic errors and design of the experiment, providing an intuitive understanding in 45 seconds.

TL;DR: Core Conclusion

LLMs currently lack human-level top-down information control, especially in scenarios requiring resistance to semantic interference and precise data extraction, leading to unstable performance.

ICML review highlights:

1. Revealed a previously unknown LLM retrieval failure phenomenon.

2. Used a cognitively inspired testing approach with significant novelty.

Author Introduction

The study was led by Chupei Wang (first and corresponding author, physics graduate from Virginia University, interdisciplinary researcher with a philosophy background) and Jiaqi Sun (co-first and corresponding author, PhD student at NYU Neuroscience Center, mentored by Prof. Xing Tian). Both have diverse backgrounds in physics, architecture, and philosophy, exploring the core points of cognitive system collapse to understand intelligence.

Special thanks to Zheyang Zheng (Visiting Researcher at Flatiron Institute CCN, PhD student at NYU) and Yilun Kuang (PhD student at NYU CILVR Lab, advised by Yann LeCun) for their key advice and support during the project.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe