OpenSource MetaQuery Arrives! OpenUni Uses 1.1B Parameters to Match BLIP3-o-8B, Fully Open Source Data and Code
With GPT-4o demonstrating impressive multimodal capabilities, unifying visual understanding and image generation into a single model has become a key research trend in AI (such as MetaQuery and BLIP3-o).
Researchers from Nanyang Technological University S-Lab and SenseTime's Singapore Research Institute launched OpenUni, an open-source version of MetaQuery, which achieves 8B model performance with only 1.1B parameters, and fully open sources code, weights, and data!

- Technical Report: OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
- Institutions: Nanyang Technological University S-Lab, SenseTime Singapore Research Institute
- Authors: Size Wu*, Zhonghua Wu*, Zerui Gong* (*Equal contribution), Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

OpenUni architecture: bridging frozen InternVL (understanding) and SANA (generation) with 256 learnable queries and 6 lightweight layers of connectors.

Figure 1: OpenUni's performance on generation tasks, demonstrating high parameter efficiency.
Three Core Advantages
1. ποΈ Simple Architecture
Only 6 layers of connectors, significantly fewer than MetaQuery's 24 layers.
2. β‘ Parameter Efficiency
1.1B parameters achieve GenEval 0.84, comparable to BLIP3-o-8B.
3. π Fully Open Source
Model weights, training code, and 23 million datasets are all publicly available.
Design and Training Strategies
Following MetaQuery's design principles, OpenUni includes four core components:
- 256 learnable queries extracting condition info from user instructions.
- Frozen InternVL maintaining original understanding capabilities.
- 6-layer transformer connector based on ViT architecture.
- SANA diffusion model for efficient image generation.
Model Comparison

* For BLIP3-o, the DiT predicting CLIP features is regarded as a connector.
Key features: reduced connector parameters, smaller MLLM and diffusion models, fully open training data.
Two-Stage Training Strategy
Stage 1: Pretraining (23 million image-text pairs)
- Objective: train learnable queries and connectors.
- Strategy: freeze MLLM and diffusion models.
- Data: public datasets, re-annotated with LLM/MLLM.
Stage 2: Fine-tuning (60,000 image-text pairs)
- Objective: improve generation quality.
- Strategy: unfreeze diffusion model, joint optimization.
- Data: high-quality datasets contributed by BLIP3-o.

Performance validation: OpenUni demonstrates excellent parameter efficiency:
- OpenUni-B-512: 1.1B active parameters, GenEval 0.84, on par with BLIP3-o-8B.
- OpenUni-L-512: 3.1B active parameters, GenEval 0.85.
- OpenUni-L-1024: 3.1B active parameters, GenEval 0.86, best among open-source unified models.
Other Benchmarks
- DPG-Bench: OpenUni-L-1024 scored 83.08, surpassing all MetaQuery and BLIP3-o variants.
- WISE: OpenUni-L achieved 0.52, comparable to BLIP3-o-8B (public data version).
Generation Results

Figure 2: Diverse high-quality images generated by OpenUni-L-1024.
Multimodal Understanding Capabilities
By adopting a frozen InternVL3 strategy, OpenUni inherits its understanding abilities:

Figure 3: Multimodal understanding demonstration of OpenUni-L.
Understanding Benchmarks

Application Prospects and Open Source Contributions
OpenUni provides comprehensive open resources:
- β Model weights - all training checkpoints
- β Training code - complete training pipeline
- β 23 million training data, including re-generated captions
- β Detailed documentation - training configs and reproduction guides
The research team notes current limitations:
- Limited ability to render text in generated images.
- Largest model based on 2B MLLM and 1.6B diffusion model, with room for expansion.
- Image-to-image generation to be supported in future versions.
- GenEval limitations: fixed prompt paradigm, significant improvements after GPT4o distillation data (BLIP4o-60K), and challenges in truly measuring model capabilities as a unified model (Show-o, Janus, Harmon, Bagel).
Summary
OpenUni offers a simple yet powerful baseline for unified multimodal models. Its minimal architecture and efficient parameter utilization demonstrate:
- Fewer parameters can achieve competitive performance.
- Simpler designs can be more effective.
- Open sourcing promotes community research and innovation.
As an ongoing project, OpenUni provides a clear, reproducible, and easily extendable baseline for the research community.