Groundbreaking Discovery! The 'Aha Moment' in Large Models Is Not Just Showmanship — Internal Information Surges by Several Times!}

Recent research reveals that large models' 'aha moments' are real, marked by sudden spikes in internal information, significantly boosting reasoning capabilities without extra training.

Groundbreaking Discovery! The 'Aha Moment' in Large Models Is Not Just Showmanship — Internal Information Surges by Several Times!}

图片

Li Yong, Associate Professor at Renmin University of China, PhD supervisor, and national high-level young talent, has long studied foundational machine learning theories. He has published over 100 papers, including nearly 50 in top-tier journals and conferences such as JMLR, IEEE TPAMI, Artificial Intelligence, ICML, and NeurIPS.

You've probably seen large models produce phrases like "Hmm...", "Wait, let me think", or "Therefore..." during problem-solving—these seem like human-like thinking words.

But a critical question remains: Do these words truly represent the model 'thinking', or are they just for show? Is it a genuine 'epiphany', or merely a 'smoke screen'?

Now, the evidence is in! A joint team from Renmin University’s Gaoling School of Artificial Intelligence, Shanghai AI Laboratory, UCL, and Dalian University of Technology used information theory to dissect the internal reasoning dynamics of large models, providing convincing answers:

When these 'thinking words' appear, the amount of information about the correct answer in the model's hidden space suddenly surges several times!

This is not mere decoration but a real information peak and decision inflection point! Even more exciting, based on this discovery, researchers proposed simple methods to significantly enhance reasoning performance without additional training. The code is open-source!

图片

Key Discovery 1: Revealing 'Information Peaks' in Large Model Reasoning Trajectories

The team tracked the 'brainwaves' (hidden space representations) of models like DeepSeek-R1 series and QwQ, which excel at reasoning, measuring the mutual information (MI) between each step and the final correct answer, observing how MI evolves.

Striking phenomenon: Reasoning is not a steady climb but involves intense 'information pulses'! At specific steps, MI values suddenly and significantly spike, forming prominent MI Peaks. These sparse but critical points are like bright signals lighting up dark reasoning paths!

图片

What does this mean? The representations at these MI peaks contain the most crucial information pointing toward the correct answer!

Further, theoretical analysis (Theorem 1 & 2) shows that the higher the MI accumulated during reasoning, the tighter the bounds on the probability of error, meaning the higher the chance of correct answers.

图片
图片

Since MI peaks are common in reasoning models (LRMs), do non-reasoning LLMs show similar phenomena?

图片

To explore this, the team tested DeepSeek-R1-Distill models and their non-reasoning counterparts. As shown in orange, non-reasoning models exhibit smaller MI fluctuations and weaker MI peaks, with overall lower MI values.

This indicates that after reinforcement training for reasoning, models encode more information about correct answers and produce MI peaks!

Core Discovery 2: 'Thinking Words' = 'Information Peaks' in Language

What semantic information do these MI peaks contain?

Interestingly, when decoding these 'information peaks' back into human-understandable language, they most often correspond to iconic 'thinking words':

  • Reflective/Pause: "Hmm", "Wait"
  • Logical/Transitional: "Therefore", "So"
  • Action: "Let", "First"
图片

For example, the team randomly selected some model outputs: "Wait, let me think differently. Let’s denote...", "Hmm, so I must have made a mistake somewhere. Let me double-check my calculations. First, ..."

They named these frequently appearing, key information-carrying, language-embedded 'thinking words' as "Thinking Tokens". These are not mere decorations but manifestations of information peaks in language, potentially acting as critical markers or decision points in the reasoning path!

To verify their importance, the team conducted intervention experiments, suppressing these thinking tokens during reasoning.

Validation: Results showed that blocking the generation of thinking tokens significantly impacted model performance on math reasoning datasets (GSM8K, MATH, AIME24); in contrast, randomly blocking other words had minimal effect. This confirms that these thinking tokens are vital for effective reasoning!

图片

Practical Implication: Enhancing Reasoning Without Additional Training Using 'Information Peaks'

Understanding the secrets of 'information peaks' and 'thinking tokens,' the researchers proposed two practical, training-free methods to boost reasoning performance of existing LRMs.

Method 1: Representation Recycling (RR)

图片
  • Inspiration: Since MI peaks contain rich information, why not let the model 'digest' it more thoroughly?
  • Approach: During reasoning, when MI peaks are detected, instead of immediately outputting, re-input the corresponding representations for an extra round of computation, allowing the model to fully utilize these high-information states.
  • Results: On multiple reasoning benchmarks (GSM8K, MATH500, AIME24), RR consistently improved reasoning performance. For example, on the challenging AIME24, accuracy increased by 20% with DeepSeek-R1-Distill-LLaMA-8B, demonstrating that better utilization of these high-information 'epiphanies' unlocks reasoning potential.

Method 2: Test-time Scaling Based on Thinking Tokens (TTTS)

图片
  • Inspiration: When increasing token budget during reasoning, how to guide the model to think more deeply rather than just extending output?
  • Approach: After initial reasoning, if token budget remains, force the model to continue generating starting with 'thinking words' like "Therefore", "So", "Wait", "Hmm" to promote deeper reasoning with extra resources.
  • Results: As shown, with increased token budget, TTTS steadily improves reasoning performance. On GSM8K and MATH500, TTTS outperforms the baseline under the same token limits. On AIME24, performance plateaus for the baseline at 4096 tokens, but TTTS continues to improve, surpassing the baseline at 6144 tokens.

Summary

This study reveals the dynamic mechanism of reasoning in LRMs: by tracking mutual information, it first observes the MI Peak phenomenon, providing a new empirical perspective on the 'black box' reasoning process. The key insight is that these peaks correspond to 'thinking tokens'—words that carry critical information and drive the model's reasoning path. Interventions blocking these tokens significantly impair performance, confirming their importance. Based on these findings, two simple, training-free methods—Representation Recycling and Thinking Token-based Test-time Scaling—are proposed to unlock and enhance reasoning capabilities, offering new avenues for understanding and improving large models.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe