Enhancing Multimodal Large Models with Game-Driven Data Synthesis: A New Approach to General Reasoning}

Research shows that using game code to generate multimodal reasoning data significantly improves the general reasoning ability of large models, making training more cost-effective and scalable.

Enhancing Multimodal Large Models with Game-Driven Data Synthesis: A New Approach to General Reasoning}

图片

If I told you that training AI on game scenarios like Sokoban can improve its geometric and chart reasoning, would you believe it?

Recent research from Fudan NLP Lab and ByteDance’s Intelligent Service Team reveals an unexpected insight: Games are not just entertainment but valuable resources for training AI reasoning capabilities.

图片

Introduction: Why is large-scale visual reasoning data so scarce?

High-quality multimodal reasoning data is critically limited, constraining the development of complex reasoning in vision-language models (VLMs). Is there a low-cost, reliable way to generate large-scale data?

Fudan and ByteDance’s research team proposed an innovative idea: using game code to automatically synthesize visual reasoning data.

图片

Figure 1: Representative games in the GameQA dataset—3D reconstruction, Tangram (variant), Sudoku, and Sokoban. Each shows two visual QA examples, including game state images, questions, reasoning steps, and answers.

From game code to reasoning data: The innovative approach of Code2Logic

Why choose game code? The team found that games have three unique advantages: First, they have well-defined rules and verifiable outcomes; second, game code encodes causal reasoning chains; third, game code can be generated easily by large language models (LLMs), at very low cost.

Based on this insight, they proposed Code2Logic, which uses LLMs to convert implicit reasoning in game code into explicit multimodal reasoning data through three core steps, as shown in Figure 2:

Step 1: Game code construction. Using LLMs (like Claude 3.5, GPT-4o) to automatically generate complete game logic, e.g., Sokoban, from a single prompt.

Step 2: QA template design. Extract reasoning patterns from game code and design corresponding tasks and question-answer templates.

Step 3: Data engine development. Build automated programs that reuse core game logic (e.g., move functions) to generate large-scale QA instances, with reasoning correctness guaranteed by code execution.

图片

Figure 2: Workflow diagram of Code2Logic method

GameQA: An extensible multimodal reasoning dataset

The team used Code2Logic to build GameQA, which has the following key features:

Large-scale and diverse. Covers 4 cognitive ability categories, 30 games, 158 reasoning tasks, and 140,000 QA pairs, as shown in Figures 1 and 3.

Cost-effective and scalable. The data engine can generate endless new samples with low cost once the code is built.

Reasonable difficulty levels. Tasks are adjustable to Easy, Medium, and Hard, with game state complexity also set at three levels, facilitating systematic evaluation of model capabilities.

图片

Figure 3: The 30 games in GameQA, categorized into 4 cognitive ability groups. Out-of-domain games are excluded from training.

Deep analysis reveals that:

  • Increasing game diversity enhances out-of-domain generalization: Models trained on 20 games outperform those trained on 4 or 10 games in unseen game tests by 1.80% and 1.20%, respectively.
  • Sample diversity correlates with better out-of-domain performance: More diverse samples lead to stronger generalization, as shown in experiments with different data configurations.

These scaling effects demonstrate that GameQA’s diversity and scalability directly boost the general reasoning capabilities of multimodal models.

图片

Figure 4: Effect of game diversity and sample diversity on model performance. Models trained on 20 games show better out-of-domain generalization and efficiency.

Further, the team analyzed the bottlenecks in VLM reasoning capabilities, including:

  • 3D spatial perception is a major weakness. In 3D maze games, models often confuse height relations, mistaking objects higher in the image for having higher Z-coordinates, revealing a fundamental spatial understanding gap.
  • Recognition and localization of objects are challenging. When visual scenes are non-grid-like (e.g., Mahjong, card games), these difficulties intensify.
  • Multiple visual inputs cause errors. During multi-image tasks, models initially perform well but can be misled by previous visual information, leading to mismatched image-text pairs.
  • Strategy planning is weak. For tasks requiring optimal solutions (like Sokoban), models lack human intuition for pruning and large-scale search, resulting in poor performance.
图片

Figure 6: GPT-4o confuses object heights in 3D maze scenarios

Conclusion

This study introduces a novel method, Code2Logic, to synthesize multimodal reasoning data from game code. The resulting GameQA dataset is low-cost, scalable, with well-designed difficulty levels and high diversity, serving as an ideal resource for training and evaluating multimodal models.

Moreover, the research demonstrates that reinforcement learning on game QA tasks can significantly enhance the general reasoning ability of multimodal models, validating the potential of games as verifiable environments for advancing AI general intelligence.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe