By Insights Team in AI — 27 Jun 2025

Google Open Sources Gemma 3n: The Most Powerful Multimodal Model with 10 Billion Parameters Running on 2GB Memory}

Google releases Gemma 3n, a compact, high-performance multimodal model with 10 billion parameters that runs on just 2GB RAM, advancing edge AI capabilities for mobile devices.

Edge devices are now equipped with a new architecture for AI models.

Early Friday morning, Google officially released and open-sourced the new multimodal large model Gemma 3n.

Model & Weights: https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4
Documentation: https://ai.google.dev/gemma/docs/gemma-3n
Blog: https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/

Google states that Gemma 3n marks a significant leap for on-device AI, bringing powerful multimodal capabilities to smartphones, tablets, and laptops—features previously only available in cloud-based advanced models.

Features of Gemma 3n include:

Multimodal Design: Supports image, audio, video, and text inputs and outputs natively.
Device Optimization: Designed for efficiency with two parameter sizes: E2B and E4B. Despite having 5B and 8B parameters respectively, they only require 2GB (E2B) and 3GB (E4B) RAM to run, thanks to architectural innovations.
Architectural Breakthroughs: Features the new MatFormer architecture, Per-Layer Embedding (PLE), and optimized audio and visual encoders based on MobileNet-v5.
Quality Improvements: Achieves high performance in multilingual understanding (140+ languages), mathematics, coding, and reasoning. The E4B version’s LM Arena score exceeds 1300, a first for models under 10B parameters.

Google emphasizes that achieving such device performance breakthroughs requires rethinking model design. The core is the innovative MatFormer architecture, which is based on the concept of nested transformers, akin to Russian nesting dolls.

MatFormer: One Model, Multiple Sizes

MatFormer allows a large model to contain smaller, fully functional sub-models, enabling flexible inference and deployment. During training of the 4B parameter (E4B) model, both the full model and the 2B sub-model are optimized simultaneously, providing two powerful use cases:

Pre-extracted models: Developers can download the high-performance E4B model or the faster E2B sub-model for inference.
Custom sizes via Mix-n-Match: Fine-tune model parameters by adjusting the feedforward network size and selectively skipping layers, enabling tailored deployment based on hardware constraints.

Future plans include enabling dynamic switching between E4B and E2B inference paths for real-time performance optimization, although this feature is not yet available.

Per-Layer Embedding (PLE): Unlocking Memory Efficiency

PLE technology allows most parameters to be loaded onto CPU, reducing VRAM requirements. Despite the total parameters being 5B and 8B, only the core transformer weights (around 2B and 4B) need to be stored on accelerators, greatly improving deployment efficiency.

Using per-layer embeddings, Gemma 3n E2B can run on just 2B parameters within AI accelerators, making it highly efficient for edge deployment.

KV Cache Sharing: Faster Long-Context Processing

To handle long sequences from audio and video streams, Gemma 3n introduces Key-Value Cache Sharing, which accelerates the first token response time by sharing intermediate attention states across layers, doubling prefill speed compared to previous models.

Audio Understanding: Speech to Text and Translation

Gemma 3n uses a sophisticated universal speech model (USM) encoder, generating tokens every 160ms of audio, enabling high-quality on-device speech recognition and translation, especially effective for languages like English, Spanish, French, Italian, and Portuguese.

Example prompt for speech translation:

<bos><start_of_turn>user

Gemma 3n can process up to 30 seconds of audio, with future improvements expected to handle longer streams with low latency.

MobileNet-V5: The Most Advanced Visual Encoder

Gemma 3n also features the new MobileNet-V5-300M visual encoder, optimized for edge devices, supporting multiple input resolutions (256×256, 512×512, 768×768), and capable of real-time processing at 60 fps on Google Pixel.

Innovations include:

Advanced architecture with MobileNet-V4 modules, Inverted Bottlenecks, and hybrid pyramid models.
Multi-scale feature fusion with VLM adapters for improved accuracy and efficiency.

MobileNet-V5-300M outperforms baseline SoViT models, achieving 13× acceleration on Google Pixel Edge TPU with quantization, 46% fewer parameters, and four times less memory usage, with higher accuracy in vision-language tasks.

Google promises more details in the upcoming MobileNet-V5 technical report.

On May 20, Google DeepMind announced Gemma-3n at I/O, highlighting its small size, high performance, and low memory footprint. The series has surpassed 160 million downloads since its initial release last year.

Subscribe to QQ Insights