Google Gemma 4 31B
Open SourceMultimodalGoogle's flagship open-source 31B parameter model with Apache 2.0 license, offering frontier-level reasoning capabilities that fit on a single consumer GPU.
Overview
Gemma 4 31B is Google DeepMind's flagship open-source model, released on April 2, 2026, under a genuine Apache 2.0 license with no usage restrictions. Built with the same underlying technology as Gemini 3, it packs 307B total parameters into a dense architecture — not a mixture-of-experts — meaning all parameters participate in every forward pass. This is a deliberate architectural choice that maximizes reasoning quality per parameter at the cost of higher compute per token.
The attention mechanism alternates between local sliding-window attention (1024-token windows) and global attention layers. This hybrid approach gives the model efficient processing of nearby context while maintaining the ability to attend to information anywhere in its 256K token context window. The result is a model that handles both short conversational exchanges and long document analysis without the quality degradation that purely local attention models suffer on longer contexts.
The performance jump from Gemma 3 is staggering: AIME 2026 went from 20.8% to 89.2% — a 4.3x improvement in competition-level math reasoning. It scores 84.3% on GPQA Diamond, 85.2% on MMLU Pro, and 80% on LiveCodeBench v6, putting it in the same tier as closed-source models that cost orders of magnitude more to run. At Q4_K_M quantization (recommended by Unsloth), the model fits in approximately 20GB of VRAM, making it runnable on a single RTX 4090 or equivalent.
Day-1 ecosystem support was comprehensive: llama.cpp, Ollama, LM Studio, vLLM, MLX, and Hugging Face Transformers all had working implementations at launch, with GGUF quantizations from Unsloth available immediately. This breadth of tooling support, combined with the truly permissive license, makes Gemma 4 31B the strongest open-source model available for both local deployment and commercial applications.
Release Date
2026-04-02
Parameters
31B
Context Window
256K tokens
Input Price
Free
Output Price
$0.4 / 1M tokens
Speed
102 tokens/sec
Benchmarks
| Benchmark | Score | Max |
|---|---|---|
| AIME 2026 | 89.2% | 100% |
| GPQA Diamond | 84.3% | 100% |
| MMLU Pro | 85.2% | 100% |
| LiveCodeBench v6 | 80% | 100% |
| Arena ELO | 1452 | 2000 |
| AA Index | 39% | 100% |
Capabilities
Math Reasoning
AIME 2026 89.2%, a 4.3x jump from Gemma 3's 20.8% — frontier-level competition math
Science Reasoning
GPQA Diamond 84.3%, competitive with closed-source models on graduate-level science
Code Generation
LiveCodeBench v6 80%, strong real-world coding performance across languages
Vision Understanding
Native image input support for document analysis, chart reading, and visual QA
Local Deployment
Fits in ~20GB VRAM at Q4_K_M quantization, runnable on a single RTX 4090
Broad Ecosystem
Day-1 support for llama.cpp, Ollama, LM Studio, vLLM, MLX, and Transformers
Getting Started
Run Locally with Ollama
Install Ollama, then run `ollama run gemma4:31b` — Q4_K_M quantization is applied automatically
Use llama.cpp
Download the GGUF file from Unsloth on Hugging Face, then run with `llama-server -m gemma-4-31b-Q4_K_M.gguf`
API Access
Use Google AI Studio (free tier) or OpenRouter ($0.14/MTok input) for cloud inference without local hardware
Fine-Tune
Full Apache 2.0 license allows unrestricted fine-tuning — use Unsloth or Hugging Face TRL for efficient LoRA training
Pros & Cons
Strengths
- +Apache 2.0 (truly open, no restrictions)
- +Best open-source reasoning on single GPU
- +89.2% AIME 2026 (math reasoning)
- +Fits on 24GB GPU with Q4 quantization
- +Day-1 support for llama.cpp, Ollama, vLLM
- +Strong vision capabilities
Weaknesses
- -Slower local inference than Qwen 3.5 (~25 vs ~35 tok/s on 4090)
- -No SWE-bench Verified score yet
- -Known infinite loop bug on document screenshots
- -Guardrails too tight causing death loops
- -Function calling error rate 2-3x higher than Claude/GPT