ICML 2025 | Latest Advances in Multimodal Understanding and Generation: HKUST and SnapResearch Release ThinkDiff, Giving Diffusion Models a Brain}

Hong Kong University and SnapResearch unveil ThinkDiff at ICML 2025, enabling diffusion models with reasoning abilities using minimal data, advancing multimodal understanding and generation.

ICML 2025 | Latest Advances in Multimodal Understanding and Generation: HKUST and SnapResearch Release ThinkDiff, Giving Diffusion Models a Brain}

图片

Author: MiZhenxing, PhD student at HKUST, specializing in multimodal understanding, 3D/4D reconstruction, currently seeking industry positions.

Since diffusion models like Stable Diffusion and Flux revolutionized image generation, progress has been made, but these models still lack true multimodal reasoning and understanding capabilities.

OpenAI’s GPT-4o image generation and Google’s Gemini Pro demonstrate powerful multimodal reasoning with massive parameters and data. However, achieving similar results with limited resources remains a challenge.

At ICML 2025, HKUST and Snap Research introduced ThinkDiff, a new approach that enables diffusion models to think and reason with only a few image-text pairs and hours of training, opening new avenues for multimodal understanding and generation.

图片

ThinkDiff Algorithm Design

The core idea of ThinkDiff is to transfer the reasoning ability of large-scale vision-language models (VLM) to diffusion models. By combining VLM’s multimodal reasoning with diffusion’s high-quality generation, the model can truly understand the logical relationships between images and text prompts, enabling high-quality image synthesis.

Shared Feature Space between LLM and Diffusion

Recent text-to-image diffusion models like Flux and Stable Diffusion 3 use large language models (e.g., T5) as text encoders. These models share the same feature space for diffusion decoders and LLM decoders, allowing diffusion models to inherit multimodal reasoning from VLMs.

图片

Aligning VLM with LLM Decoder

Direct alignment requires extensive data and inefficient training. Instead, ThinkDiff uses a proxy task: vision-language pretraining (VLP) to align VLM with the LLM decoder. Once aligned, the shared feature space naturally aligns VLM with the diffusion decoder.

During training, images and texts are input into VLM, which autoregressively generates multimodal feature vectors. A lightweight aligner then maps these features into the LLM’s input space, reconstructing text descriptions with cross-entropy supervision.

Post-training, the aligner enables the diffusion model to inherit multimodal reasoning capabilities, allowing it to perform complex reasoning and generation tasks in multimodal contexts.

图片

Core Network Design

Aligning VLM-generated tokens to the diffusion model involves:

  • Token Alignment: Aligning autoregressively generated tokens from VLM to the diffusion decoder, enabling multimodal reasoning transfer.
  • Masked Training: Randomly masking tokens during training to prevent shortcut learning, ensuring the aligner deeply understands image-text semantics.

Network Variants

Depending on the VLM used, ThinkDiff has two variants:

  • ThinkDiff-LVLM: Aligns large vision-language models to diffusion, inheriting multimodal understanding.
  • ThinkDiff-CLIP: Aligns CLIP to diffusion, enhancing text-image pairing capabilities.

Experimental Results

Quantitative Multimodal Understanding and Generation: ThinkDiff-LVLM significantly outperforms existing methods on the CoBSAT benchmark, demonstrating high accuracy and quality.

图片
图片

Compared to other methods using hundreds of GPUs, ThinkDiff-LVLM achieves optimal results with only 5 hours × 4 A100 GPUs.

图片

Multimodal Reasoning and Image Generation Results

ThinkDiff-LVLM can deeply reason over multimodal inputs and generate high-quality images, comparable to models like Gemini.

Comparison with Gemini

Gemini:

图片

Our Approach:

图片

In multi-image input scenarios, ThinkDiff-CLIP can effectively combine multiple images into coherent outputs.

图片

It can also generate high-quality videos from images and text inputs without retraining, by replacing the diffusion decoder with CogVideo.

图片

Summary

ThinkDiff effectively transfers multimodal reasoning to diffusion models, enabling high-quality, resource-efficient multimodal understanding and generation, surpassing existing open-source and commercial models in experiments, with broad applications in research and industry.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe