By Insights Team in AI — 23 Jun 2025

OpenSource MetaQuery Arrives! OpenUni Uses 1.1B Parameters to Match BLIP3-o-8B, Fully Open Source Data and Code

With GPT-4o demonstrating impressive multimodal capabilities, unifying visual understanding and image generation into a single model has become a key research trend in AI (such as MetaQuery and BLIP3-o).

Researchers from Nanyang Technological University S-Lab and SenseTime's Singapore Research Institute launched OpenUni, an open-source version of MetaQuery, which achieves 8B model performance with only 1.1B parameters, and fully open sources code, weights, and data!

Technical Report: OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
Institutions: Nanyang Technological University S-Lab, SenseTime Singapore Research Institute
Authors: Size Wu*, Zhonghua Wu*, Zerui Gong* (*Equal contribution), Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

OpenUni architecture: bridging frozen InternVL (understanding) and SANA (generation) with 256 learnable queries and 6 lightweight layers of connectors.

Figure 1: OpenUni's performance on generation tasks, demonstrating high parameter efficiency.

Three Core Advantages

1. 🏗️ Simple Architecture

Only 6 layers of connectors, significantly fewer than MetaQuery's 24 layers.

2. ⚡ Parameter Efficiency

1.1B parameters achieve GenEval 0.84, comparable to BLIP3-o-8B.

3. 📂 Fully Open Source

Model weights, training code, and 23 million datasets are all publicly available.

Design and Training Strategies

Following MetaQuery's design principles, OpenUni includes four core components:

256 learnable queries extracting condition info from user instructions.
Frozen InternVL maintaining original understanding capabilities.
6-layer transformer connector based on ViT architecture.
SANA diffusion model for efficient image generation.

Model Comparison

* For BLIP3-o, the DiT predicting CLIP features is regarded as a connector.

Key features: reduced connector parameters, smaller MLLM and diffusion models, fully open training data.

Two-Stage Training Strategy

Stage 1: Pretraining (23 million image-text pairs)

Objective: train learnable queries and connectors.
Strategy: freeze MLLM and diffusion models.
Data: public datasets, re-annotated with LLM/MLLM.

Stage 2: Fine-tuning (60,000 image-text pairs)

Objective: improve generation quality.
Strategy: unfreeze diffusion model, joint optimization.
Data: high-quality datasets contributed by BLIP3-o.

Performance validation: OpenUni demonstrates excellent parameter efficiency:

OpenUni-B-512: 1.1B active parameters, GenEval 0.84, on par with BLIP3-o-8B.
OpenUni-L-512: 3.1B active parameters, GenEval 0.85.
OpenUni-L-1024: 3.1B active parameters, GenEval 0.86, best among open-source unified models.

Other Benchmarks

DPG-Bench: OpenUni-L-1024 scored 83.08, surpassing all MetaQuery and BLIP3-o variants.
WISE: OpenUni-L achieved 0.52, comparable to BLIP3-o-8B (public data version).

Generation Results

Figure 2: Diverse high-quality images generated by OpenUni-L-1024.

Multimodal Understanding Capabilities

By adopting a frozen InternVL3 strategy, OpenUni inherits its understanding abilities:

Figure 3: Multimodal understanding demonstration of OpenUni-L.

Understanding Benchmarks

Application Prospects and Open Source Contributions

OpenUni provides comprehensive open resources:

✅ Model weights - all training checkpoints
✅ Training code - complete training pipeline
✅ 23 million training data, including re-generated captions
✅ Detailed documentation - training configs and reproduction guides

The research team notes current limitations:

Limited ability to render text in generated images.
Largest model based on 2B MLLM and 1.6B diffusion model, with room for expansion.
Image-to-image generation to be supported in future versions.
GenEval limitations: fixed prompt paradigm, significant improvements after GPT4o distillation data (BLIP4o-60K), and challenges in truly measuring model capabilities as a unified model (Show-o, Janus, Harmon, Bagel).

Summary

OpenUni offers a simple yet powerful baseline for unified multimodal models. Its minimal architecture and efficient parameter utilization demonstrate:

Fewer parameters can achieve competitive performance.
Simpler designs can be more effective.
Open sourcing promotes community research and innovation.

As an ongoing project, OpenUni provides a clear, reproducible, and easily extendable baseline for the research community.