By Insights Team in AI — 03 Jul 2025

SinoAI Launches OmniGen2: One-Click Unlock for AI-Generated 'Doraemon' Gateways}

OmniGen2, the open-source unified image generation model, supports multi-task image creation from natural language commands, enabling versatile AI image editing and generation.

In September 2024, the ZhiYuan Research Institute released OmniGen, a unified image generation model capable of supporting various tasks such as Text-to-Image, Image Editing, and Subject-driven Image Generation. Users can flexibly achieve these functions using multimodal natural language instructions without relying on extra prompts, plugins, or preprocessing modules. Its high versatility and simple architecture have garnered widespread praise from the community. Following the release, with the emergence of closed-source multimodal models like Gemini 2.0 Flash and GPT-4o, building a unified image generation model has become a key research and application focus.

In this context, OmniGen received a major upgrade, officially launching OmniGen2. The new generation maintains a simple architecture but significantly enhances context understanding, instruction following, and image quality. It inherits the multimodal capabilities of its base model, supporting both image and text generation, further integrating multimodal technology ecosystems. All model weights, training code, and datasets will be open-sourced, providing a foundation for community development and optimization. These features will accelerate the transition of unified image generation from concept to reality.

1. Separate Architecture + Dual Encoders

OmniGen2 adopts a decoupled architecture for text and image, utilizing ViT and VAE as dual encoders. Unlike other works, ViT and VAE operate independently within the MLLM and Diffusion Transformer, improving image consistency while maintaining original text generation capabilities.

2. Data Generation Process Reconstructed

OmniGen2 also explores solutions to the quality and evaluation issues in foundational datasets, especially for image editing tasks. It develops a process to generate image editing and context reference data from video and image datasets.

3. Reflection Mechanism in Image Generation

Inspired by large language models’ self-reflection abilities, OmniGen2 incorporates a reflection strategy into multimodal generation. It constructs reflection data consisting of alternating text and images, including user instructions, generated images, and subsequent reflections.

Each reflection involves analyzing deficiencies related to the original instructions and proposing solutions to improve the next output.

Trained models show preliminary reflection ability, with future plans to enhance this via reinforcement learning.

Within a week of release, OmniGen2’s GitHub stars exceeded 2000, and related discussions on WeChat and other platforms reached hundreds of thousands of views.

The research demo version is now open for early testing of image editing, in-context image generation, and other features:

Research Demo Link

Using OmniGen2 is simple: just input prompts to unlock rich image editing and generation capabilities.

1. Natural Language-Based Image Editing

Supports local modifications such as object addition/removal, color adjustments, expression changes, and background replacements based on natural language commands.

2. Multimodal Context-Aware Image Generation

Extracts specific elements from input images and generates new images based on these, such as placing objects or people into new scenes. Currently, OmniGen2 excels at maintaining object similarity rather than face similarity.

3. Text-to-Image Generation

Capable of generating images of arbitrary aspect ratios.

OmniGen2 has achieved competitive results on existing benchmarks, including text-to-image and image editing. However, for in-context image generation, there is still a lack of comprehensive public benchmarks for systematic evaluation and comparison.

Existing benchmarks often fail to capture real-world scenarios involving multiple input images and are limited by context and task types. Previous metrics like CLIP-I and DINO evaluate image similarity but lack interpretability, especially for multi-topic scenes.

To address this, the team introduced OmniContext, which includes 8 task categories for evaluating consistency among people, objects, and scenes. Data is constructed using a hybrid approach combining multimodal large language models and manual annotation.

OmniGen2 leverages the self-developed FlagScale framework for training and inference, optimizing reasoning deployment. Through deep reconstruction of inference chains and integrating TeaCache caching strategies, inference efficiency is improved by 32%, significantly reducing response times and enhancing service performance. The framework supports one-click elastic deployment across multiple machines, improving overall resource utilization. The team will continue to advance software-hardware co-optimization to build a high-efficiency inference deployment system.

All model weights, training code, and datasets will be open-sourced, accelerating the transition from concept to practical application of a unified image generation model.

OmniGen2 related links:

GitHub: https://github.com/VectorSpaceLab/OmniGen2/
Paper: https://arxiv.org/abs/2506.18871
Model: https://huggingface.co/BAAI/OmniGen2
Research Demo: https://genai.baai.ac.cn