Google Gemma 4 26B-A4B
Open SourceMultimodalGoogle's MoE variant of Gemma 4 with 26B total parameters but only 3.8B active, delivering 97% of the 31B model's quality at significantly lower compute cost.
Overview
Gemma 4 26B-A4B is the mixture-of-experts variant in the Gemma 4 family, released on April 3, 2026, under Apache 2.0. It uses a fine-grained MoE architecture with 128 small experts, of which 8 plus 1 shared expert are active per token — meaning only 3.8B parameters fire during each forward pass, despite having 26B total parameters. This design delivers 97% of the dense 31B model's quality while dramatically reducing inference compute.
The key trade-off to understand is memory versus compute. All 26B parameters must reside in memory (it loads like a large model), but it runs like a small one during inference. This means you need the same VRAM as a full 26B model, but you get throughput closer to a 4B model. For API providers and cloud deployments, this is an excellent deal — OpenRouter offers it at $0.13 per million input tokens, making it the cheapest frontier-quality model available.
Benchmark performance validates the 97% claim. AIME 2026 comes in at 88.3% versus the 31B's 89.2%, GPQA Diamond at 82.3% versus 84.3%, and LiveCodeBench v6 at 77.1% versus 80%. The Arena ELO of 1441 places it firmly in the competitive tier, and MMLU Pro at 82.6% shows broad general knowledge. For most practical tasks, the quality difference from the full 31B is imperceptible.
Like its 31B sibling, it benefits from full Apache 2.0 licensing and day-1 ecosystem support across llama.cpp, Ollama, vLLM, and other popular frameworks. The MoE architecture makes it particularly well-suited for high-throughput serving scenarios where you need to handle many concurrent requests efficiently — the lower per-token compute means more requests per GPU-second.
Release Date
2026-04-03
Parameters
26B MoE (3.8B active)
Context Window
256K tokens
Input Price
Free
Output Price
$0.4 / 1M tokens
Speed
99 tokens/sec
Benchmarks
| Benchmark | Score | Max |
|---|---|---|
| AIME 2026 | 88.3% | 100% |
| GPQA Diamond | 82.3% | 100% |
| MMLU Pro | 82.6% | 100% |
| LiveCodeBench v6 | 77.1% | 100% |
| Arena ELO | 1441 | 2000 |
Capabilities
MoE Efficiency
128 experts with 8+1 active per token — only 3.8B params fire per forward pass from 26B total
Near-Full Quality
97% of the dense 31B model's performance across math, science, and coding benchmarks
Cost-Effective API
$0.13/MTok on OpenRouter, the cheapest frontier-quality model available
Math Reasoning
AIME 2026 88.3%, within 1 point of the full 31B dense model
High Throughput
MoE architecture enables efficient concurrent request handling for production serving
Open Source
Full Apache 2.0 license with day-1 support for llama.cpp, Ollama, and vLLM
Getting Started
Run Locally with Ollama
Install Ollama, then run `ollama run gemma4:26b-a4b` — note: requires same VRAM as a 26B model despite 3.8B active params
API Access
Use OpenRouter at $0.13/MTok input or Google AI Studio (free tier) for immediate cloud access
Use llama.cpp
Download the GGUF from Unsloth on Hugging Face, run with `llama-server -m gemma-4-26b-a4b-Q4_K_M.gguf`
Deploy for Production
Use vLLM for high-throughput serving — the MoE architecture shines under concurrent load
Pros & Cons
Strengths
- +97% of 31B quality with 3.8B active params
- +Cheapest frontier-quality API ($0.13/MTok)
- +Apache 2.0 license
- +MoE efficiency for high throughput
Weaknesses
- -All 26B params must be in memory (loads like large model)
- -Slower than expected on some GPUs (11 tok/s vs Qwen 60+)
- -Framework optimization still lagging
- -Crashes in some pipelines