Google Gemma 4 26B-A4B

Open SourceMultimodal

Google

Google's MoE variant of Gemma 4 with 26B total parameters but only 3.8B active, delivering 97% of the 31B model's quality at significantly lower compute cost.

Overview

Gemma 4 26B-A4B is the mixture-of-experts variant in the Gemma 4 family, released on April 3, 2026, under Apache 2.0. It uses a fine-grained MoE architecture with 128 small experts, of which 8 plus 1 shared expert are active per token — meaning only 3.8B parameters fire during each forward pass, despite having 26B total parameters. This design delivers 97% of the dense 31B model's quality while dramatically reducing inference compute.

The key trade-off to understand is memory versus compute. All 26B parameters must reside in memory (it loads like a large model), but it runs like a small one during inference. This means you need the same VRAM as a full 26B model, but you get throughput closer to a 4B model. For API providers and cloud deployments, this is an excellent deal — OpenRouter offers it at $0.13 per million input tokens, making it the cheapest frontier-quality model available.

Benchmark performance validates the 97% claim. AIME 2026 comes in at 88.3% versus the 31B's 89.2%, GPQA Diamond at 82.3% versus 84.3%, and LiveCodeBench v6 at 77.1% versus 80%. The Arena ELO of 1441 places it firmly in the competitive tier, and MMLU Pro at 82.6% shows broad general knowledge. For most practical tasks, the quality difference from the full 31B is imperceptible.

Like its 31B sibling, it benefits from full Apache 2.0 licensing and day-1 ecosystem support across llama.cpp, Ollama, vLLM, and other popular frameworks. The MoE architecture makes it particularly well-suited for high-throughput serving scenarios where you need to handle many concurrent requests efficiently — the lower per-token compute means more requests per GPU-second.

Release Date

2026-04-03

Parameters

26B MoE (3.8B active)

Context Window

256K tokens

Input Price

Free

Output Price

$0.4 / 1M tokens

Speed

99 tokens/sec

Benchmarks

BenchmarkScoreMax
AIME 202688.3%100%
GPQA Diamond82.3%100%
MMLU Pro82.6%100%
LiveCodeBench v677.1%100%
Arena ELO14412000

Capabilities

MoE Efficiency

128 experts with 8+1 active per token — only 3.8B params fire per forward pass from 26B total

Near-Full Quality

97% of the dense 31B model's performance across math, science, and coding benchmarks

Cost-Effective API

$0.13/MTok on OpenRouter, the cheapest frontier-quality model available

Math Reasoning

AIME 2026 88.3%, within 1 point of the full 31B dense model

High Throughput

MoE architecture enables efficient concurrent request handling for production serving

Open Source

Full Apache 2.0 license with day-1 support for llama.cpp, Ollama, and vLLM

Getting Started

1

Run Locally with Ollama

Install Ollama, then run `ollama run gemma4:26b-a4b` — note: requires same VRAM as a 26B model despite 3.8B active params

2

API Access

Use OpenRouter at $0.13/MTok input or Google AI Studio (free tier) for immediate cloud access

3

Use llama.cpp

Download the GGUF from Unsloth on Hugging Face, run with `llama-server -m gemma-4-26b-a4b-Q4_K_M.gguf`

4

Deploy for Production

Use vLLM for high-throughput serving — the MoE architecture shines under concurrent load

Pros & Cons

Strengths

  • +97% of 31B quality with 3.8B active params
  • +Cheapest frontier-quality API ($0.13/MTok)
  • +Apache 2.0 license
  • +MoE efficiency for high throughput

Weaknesses

  • -All 26B params must be in memory (loads like large model)
  • -Slower than expected on some GPUs (11 tok/s vs Qwen 60+)
  • -Framework optimization still lagging
  • -Crashes in some pipelines

Best For

Budget API inferenceHigh-throughput servingBest value open-source model

Compare

Official Links