Google Gemma 4 31B

Open SourceMultimodal

Google

Google's flagship open-source 31B parameter model with Apache 2.0 license, offering frontier-level reasoning capabilities that fit on a single consumer GPU.

Overview

Gemma 4 31B is Google DeepMind's flagship open-source model, released on April 2, 2026, under a genuine Apache 2.0 license with no usage restrictions. Built with the same underlying technology as Gemini 3, it packs 307B total parameters into a dense architecture — not a mixture-of-experts — meaning all parameters participate in every forward pass. This is a deliberate architectural choice that maximizes reasoning quality per parameter at the cost of higher compute per token.

The attention mechanism alternates between local sliding-window attention (1024-token windows) and global attention layers. This hybrid approach gives the model efficient processing of nearby context while maintaining the ability to attend to information anywhere in its 256K token context window. The result is a model that handles both short conversational exchanges and long document analysis without the quality degradation that purely local attention models suffer on longer contexts.

The performance jump from Gemma 3 is staggering: AIME 2026 went from 20.8% to 89.2% — a 4.3x improvement in competition-level math reasoning. It scores 84.3% on GPQA Diamond, 85.2% on MMLU Pro, and 80% on LiveCodeBench v6, putting it in the same tier as closed-source models that cost orders of magnitude more to run. At Q4_K_M quantization (recommended by Unsloth), the model fits in approximately 20GB of VRAM, making it runnable on a single RTX 4090 or equivalent.

Day-1 ecosystem support was comprehensive: llama.cpp, Ollama, LM Studio, vLLM, MLX, and Hugging Face Transformers all had working implementations at launch, with GGUF quantizations from Unsloth available immediately. This breadth of tooling support, combined with the truly permissive license, makes Gemma 4 31B the strongest open-source model available for both local deployment and commercial applications.

Release Date

2026-04-02

Parameters

31B

Context Window

256K tokens

Input Price

Free

Output Price

$0.4 / 1M tokens

Speed

102 tokens/sec

Benchmarks

BenchmarkScoreMax
AIME 202689.2%100%
GPQA Diamond84.3%100%
MMLU Pro85.2%100%
LiveCodeBench v680%100%
Arena ELO14522000
AA Index39%100%

Capabilities

Math Reasoning

AIME 2026 89.2%, a 4.3x jump from Gemma 3's 20.8% — frontier-level competition math

Science Reasoning

GPQA Diamond 84.3%, competitive with closed-source models on graduate-level science

Code Generation

LiveCodeBench v6 80%, strong real-world coding performance across languages

Vision Understanding

Native image input support for document analysis, chart reading, and visual QA

Local Deployment

Fits in ~20GB VRAM at Q4_K_M quantization, runnable on a single RTX 4090

Broad Ecosystem

Day-1 support for llama.cpp, Ollama, LM Studio, vLLM, MLX, and Transformers

Getting Started

1

Run Locally with Ollama

Install Ollama, then run `ollama run gemma4:31b` — Q4_K_M quantization is applied automatically

2

Use llama.cpp

Download the GGUF file from Unsloth on Hugging Face, then run with `llama-server -m gemma-4-31b-Q4_K_M.gguf`

3

API Access

Use Google AI Studio (free tier) or OpenRouter ($0.14/MTok input) for cloud inference without local hardware

4

Fine-Tune

Full Apache 2.0 license allows unrestricted fine-tuning — use Unsloth or Hugging Face TRL for efficient LoRA training

Pros & Cons

Strengths

  • +Apache 2.0 (truly open, no restrictions)
  • +Best open-source reasoning on single GPU
  • +89.2% AIME 2026 (math reasoning)
  • +Fits on 24GB GPU with Q4 quantization
  • +Day-1 support for llama.cpp, Ollama, vLLM
  • +Strong vision capabilities

Weaknesses

  • -Slower local inference than Qwen 3.5 (~25 vs ~35 tok/s on 4090)
  • -No SWE-bench Verified score yet
  • -Known infinite loop bug on document screenshots
  • -Guardrails too tight causing death loops
  • -Function calling error rate 2-3x higher than Claude/GPT

Best For

Local AI deployment on single GPUOpen-source reasoning tasksBudget API inferenceMath and science reasoning

Compare

Official Links