8B Model Surpasses GPT-4o! ParallelKV Cache Compression Supports 128K Length Extrapolation with ParallelComp Method}

The 8B model, with ParallelComp, achieves performance beyond GPT-4o by supporting 128K length extrapolation through parallel KV cache compression, enabling efficient long-text reasoning.

8B Model Surpasses GPT-4o! ParallelKV Cache Compression Supports 128K Length Extrapolation with ParallelComp Method}

Image

Author: Xiong Jing, a first-year PhD student at HKU, supervised by Prof. Huang Yi and Prof. Kong Lingpeng. Published papers at ICLR, ICML, NeurIPS, ACL, EMNLP, TMLR. Homepage: https://menik1126.github.io/

Introduction: Bottlenecks and Breakthroughs in Long-Text Reasoning for Large Models

As large language models (LLMs) grow more capable, understanding and processing ultra-long texts become critical. Current models handle 4K-8K tokens efficiently during training, but face challenges with inference on texts over 128K tokens due to memory limits and attention sink issues, leading to information loss and limiting real-world applications.

Two main bottlenecks in efficient long-text inference are position encoding length extrapolation and memory constraints. Position encoding methods include frequency-based NTK interpolation and chunking strategies, but their effectiveness at lengths >128K remains uncertain. Memory bottlenecks are addressed via KV cache compression, but attention sink phenomena, where attention concentrates at text ends, still need exploration.

Image

Figure 1: Attention distribution differences between standard and parallel attention mechanisms, with the latter showing multi-peak sink phenomena in ultra-long texts, requiring systematic analysis and solutions.

Our Innovation: ParallelComp, a New Paradigm for Ultra-Long Text Reasoning

To address these issues, we propose a training-free long-text processing solution — ParallelComp. It features parallel attention chunking, intelligent KV cache eviction, and attention bias calibration, enabling efficient inference on texts up to 128K tokens.

Image

Figure 2: Overall workflow of ParallelComp

1. Parallel Chunk Attention for Easy Long-Text Processing

ParallelComp divides input texts into chunks based on maximum model length, processes them in parallel with queries, and reuses positional encodings, enabling seamless extrapolation to 128K tokens without performance loss (see Figure 2).

2. Dual Eviction Mechanisms for Extreme Memory Compression

  • Chunk Eviction: Scores tokens within each chunk based on information content, retaining only the most relevant parts to reduce memory usage.
  • KV Cache Eviction: Uses FlashAttention to dynamically evaluate token importance, automatically discarding less relevant tokens, compressing KV caches at chunk granularity.

These mechanisms reduce inference memory to 64G, allowing 128K length inference via parallel pre-filling, greatly improving batch inference and multi-GPU efficiency.

Image

Figure 3: Typical attention bias distributions (U-shape, mountain, uniform) in parallel attention, with multi-peak sink phenomena requiring targeted calibration.

3. Attention Bias Calibration to Tackle Multi-peak Sink Phenomena

Analysis reveals that parallel chunking causes multi-peak attention distributions, leading to information gaps. We propose hierarchical bias calibration, suppressing overly focused tokens to smooth attention and reduce dead zones, improving long-text reasoning.

Layer-wise bias removal strategies further optimize performance, with certain layers' attention biases being critical for context learning and reasoning. Theoretical analysis confirms that sparse attention under parallel chunking balances efficiency and information fidelity, with optimal parameters derived for best performance.

Image

Table 1: Impact of bias token removal at different layers on various tasks

Shallow layers (1-8) are crucial for retrieval tasks; removing bias tokens here weakens performance. Deep layers (9-16) are key for reasoning; removing them degrades accuracy. Middle layers' biases influence comprehension and reading tasks.

Our theory indicates that increasing chunk numbers and length leads to attention sparsity, focusing on few tokens. Proper sparse parameters and hierarchical calibration achieve a balance between efficiency and information preservation.

Image

Table 2: Performance on LongBench

Image

Table 3: Performance on InfiniteBench

Image

Table 4: Perplexity (PPL) performance

Image

Table 5: Speedup ratios and performance analysis

Large-scale experiments show that ParallelComp enables models with only 8B parameters, trained on 8K context, to seamlessly extrapolate inference length from 8K to 128K, with pre-filling acceleration up to 23.5x. The small model achieves 91.17% of GPT-4o's performance on long-text tasks, surpassing GPT-4o, Claude-2, and Kimi-Chat in specific scenarios. This demonstrates the method's applicability in resource-constrained environments and real-world deployment.

Conclusion and Future Outlook

ParallelComp offers a system-level leap in long-text reasoning for LLMs—improving length, efficiency, and balance without retraining, by leveraging intelligent chunking and eviction mechanisms. It is expected to benefit applications like legal documents, medical literature, and novels, further advancing long-text AI capabilities.

]

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe