OpenSource MetaQuery Arrives! OpenUni Uses 1.1B Parameters to Match BLIP3-o-8B, Fully Open Source Data and Code

With GPT-4o demonstrating impressive multimodal capabilities, unifying visual understanding and image generation into a single model has become a key research trend in AI (such as MetaQuery and BLIP3-o).

Researchers from Nanyang Technological University S-Lab and SenseTime's Singapore Research Institute launched OpenUni, an open-source version of MetaQuery, which achieves 8B model performance with only 1.1B parameters, and fully open sources code, weights, and data!

OpenUni architecture
OpenUni architecture diagram

OpenUni architecture: bridging frozen InternVL (understanding) and SANA (generation) with 256 learnable queries and 6 lightweight layers of connectors.

OpenUni architecture details

Figure 1: OpenUni's performance on generation tasks, demonstrating high parameter efficiency.

Three Core Advantages

1. πŸ—οΈ Simple Architecture

Only 6 layers of connectors, significantly fewer than MetaQuery's 24 layers.

2. ⚑ Parameter Efficiency

1.1B parameters achieve GenEval 0.84, comparable to BLIP3-o-8B.

3. πŸ“‚ Fully Open Source

Model weights, training code, and 23 million datasets are all publicly available.

Design and Training Strategies

Following MetaQuery's design principles, OpenUni includes four core components:

  1. 256 learnable queries extracting condition info from user instructions.
  2. Frozen InternVL maintaining original understanding capabilities.
  3. 6-layer transformer connector based on ViT architecture.
  4. SANA diffusion model for efficient image generation.

Model Comparison

Model comparison

* For BLIP3-o, the DiT predicting CLIP features is regarded as a connector.

Key features: reduced connector parameters, smaller MLLM and diffusion models, fully open training data.

Two-Stage Training Strategy

Stage 1: Pretraining (23 million image-text pairs)

  • Objective: train learnable queries and connectors.
  • Strategy: freeze MLLM and diffusion models.
  • Data: public datasets, re-annotated with LLM/MLLM.

Stage 2: Fine-tuning (60,000 image-text pairs)

  • Objective: improve generation quality.
  • Strategy: unfreeze diffusion model, joint optimization.
  • Data: high-quality datasets contributed by BLIP3-o.
Training process

Performance validation: OpenUni demonstrates excellent parameter efficiency:

  • OpenUni-B-512: 1.1B active parameters, GenEval 0.84, on par with BLIP3-o-8B.
  • OpenUni-L-512: 3.1B active parameters, GenEval 0.85.
  • OpenUni-L-1024: 3.1B active parameters, GenEval 0.86, best among open-source unified models.

Other Benchmarks

  • DPG-Bench: OpenUni-L-1024 scored 83.08, surpassing all MetaQuery and BLIP3-o variants.
  • WISE: OpenUni-L achieved 0.52, comparable to BLIP3-o-8B (public data version).

Generation Results

Generated images

Figure 2: Diverse high-quality images generated by OpenUni-L-1024.

Multimodal Understanding Capabilities

By adopting a frozen InternVL3 strategy, OpenUni inherits its understanding abilities:

Multimodal understanding

Figure 3: Multimodal understanding demonstration of OpenUni-L.

Understanding Benchmarks

Understanding benchmarks

Application Prospects and Open Source Contributions

OpenUni provides comprehensive open resources:

  • βœ… Model weights - all training checkpoints
  • βœ… Training code - complete training pipeline
  • βœ… 23 million training data, including re-generated captions
  • βœ… Detailed documentation - training configs and reproduction guides

The research team notes current limitations:

  • Limited ability to render text in generated images.
  • Largest model based on 2B MLLM and 1.6B diffusion model, with room for expansion.
  • Image-to-image generation to be supported in future versions.
  • GenEval limitations: fixed prompt paradigm, significant improvements after GPT4o distillation data (BLIP4o-60K), and challenges in truly measuring model capabilities as a unified model (Show-o, Janus, Harmon, Bagel).

Summary

OpenUni offers a simple yet powerful baseline for unified multimodal models. Its minimal architecture and efficient parameter utilization demonstrate:

  • Fewer parameters can achieve competitive performance.
  • Simpler designs can be more effective.
  • Open sourcing promotes community research and innovation.

As an ongoing project, OpenUni provides a clear, reproducible, and easily extendable baseline for the research community.

Subscribe to QQ Insights

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe