Gemma 4 on RockB

Run Gemma 4 Locally in 2026: 31B Dense Setup Guide with Ollama

Thu, 07 May 2026 06:04:01 +0000

Gemma 4 31B Dense runs locally on a single RTX 4090 or Mac M3 Max using Ollama — no API key, no data leaving your machine. Install Ollama, run ollama pull gemma4:31b, and you have a model that scores 87.1% on MMLU, beating GPT-4o’s 86.5%, running entirely on your hardware.

What Is Gemma 4 31B Dense and Why Run It Locally?

Gemma 4 31B Dense is a 31-billion-parameter language model released by Google DeepMind on April 2, 2026, under the Apache 2.0 license. Unlike mixture-of-experts architectures that distribute parameters across sparse expert layers, the 31B Dense model activates all 31 billion parameters on every token — giving it more reliable reasoning depth than larger MoE models with similar active parameter counts. In benchmark testing, Gemma 4 31B scores 87.1% on MMLU (beating GPT-4o’s 86.5%), 89.2% on AIME 2026, and 84.3% on GPQA Diamond — outperforming Llama 4 Scout’s 109B MoE model on the harder science benchmarks. Running it locally means zero API costs, complete data privacy, no rate limits, and the ability to integrate with any tool via the OpenAI-compatible REST endpoint that Ollama exposes on localhost:11434. For developers, researchers, or privacy-conscious users, this is the highest-performing open model available for on-device inference as of mid-2026.

Dense vs. MoE: Why the Architecture Matters for Local Inference

A dense model like Gemma 4 31B activates every parameter on every forward pass. An MoE model like Llama 4 Scout (109B total, ~17B active) routes each token through only a subset of expert layers. For local inference, the dense architecture has a decisive advantage: total VRAM needed corresponds directly to the active parameter count. With Q4_K_M quantization — Ollama’s default — Gemma 4 31B fits in approximately 24GB VRAM, which is exactly what a single RTX 4090 or RTX 6000 Ada provides. A 109B MoE model at the same quantization still requires routing infrastructure and substantially more memory even if active parameters are lower, making it harder to run on consumer hardware without CPU offloading.

Gemma 4 Model Variants: E2B, E4B, 26B, and 31B Compared

Gemma 4 ships in four variants with meaningfully different hardware requirements and capability profiles. The E2B (2B Edge) and E4B (4B Edge) models are designed for mobile and embedded deployment — they feature native audio input and a 128K context window, making them unique among the family. The 26B and 31B models target server and workstation deployment, both supporting a 256K token context window and excelling at multi-step reasoning, coding, and mathematics. The 31B Dense specifically is the flagship for local deployment: it is natively trained on over 140 languages, released under Apache 2.0, and achieves GPT-4o-class performance on a single high-end consumer GPU. The choice between variants comes down almost entirely to available VRAM, since quality scales predictably across the lineup.

Variant	Active Params	VRAM (Q4_K_M)	VRAM (FP16)	Context	Best For
E2B	2B	~1.5 GB	~4 GB	128K	Mobile, edge devices
E4B	4B	~2.8 GB	~8 GB	128K	Laptop CPU/integrated GPU
12B	12B	~6.6 GB	~24 GB	128K	RTX 3060, M2 MacBook
26B	26B	~14 GB	~52 GB	256K	RTX 3090, M3 Pro
31B	31B	~18–24 GB	~62 GB	256K	RTX 4090, M3 Max, M4 Ultra

Which Variant Should You Pick?

If you have 24GB VRAM (RTX 4090, RTX 6000 Ada) or 32GB+ unified memory (M3 Max, M4 Pro/Max), run the 31B. If you have 16GB VRAM (RTX 4080, A4000), run the 26B at Q4_K_M. For anything with 8–12GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB), the 12B variant is the correct choice — it requires only 6.6GB VRAM at Q4 quantization and delivers strong coding and reasoning performance. The E2B and E4B are specifically for devices without a discrete GPU.

Hardware Requirements for Gemma 4 31B (VRAM, RAM, CPU)

Gemma 4 31B Dense requires 24GB VRAM at Q4_K_M quantization or 62GB VRAM at full FP16 precision. In practice, Q4_K_M is the correct target for consumer hardware: Ollama defaults to this quantization automatically, reducing memory usage by approximately 55–60% compared to FP16, with only a marginal quality drop that is typically imperceptible in conversational and coding tasks. The minimum viable GPU is a single RTX 4090 (24GB). For Mac users, the M3 Max (36GB or 48GB unified memory) and M4 Pro/Max provide excellent performance because Apple Silicon shares memory between CPU and GPU — you can run the 31B comfortably with 36GB total unified memory. Linux workstations with dual RTX 3090s (24GB each) can also run the 31B by splitting the model across GPUs, though this requires additional configuration and results in slower inference than a single 4090.

GPU / Platform	VRAM / Unified Memory	Gemma 4 31B (Q4)?	Notes
RTX 4090	24 GB	Yes	Ideal single-GPU setup
RTX 6000 Ada	48 GB	Yes	Runs FP16 too
RTX 4080	16 GB	No	Use 26B instead
RTX 3090 x2	48 GB total	Yes	Slower, split model
M3 Max 36GB	36 GB unified	Yes	Excellent tok/s
M4 Max 64GB	64 GB unified	Yes	Can run FP16
M2 MacBook Pro 16GB	16 GB unified	No	Use 12B instead

System RAM: Ollama also uses system RAM for the context cache. Aim for at least 32GB system RAM when running 31B. CPU doesn’t significantly affect generation speed once the model is loaded into VRAM — but fast NVMe SSD storage (PCIe 4.0+) reduces initial model load time from cold.

Step 1 — Install Ollama on Mac, Windows, or Linux

Ollama is the fastest path to running Gemma 4 31B locally, providing a one-command model download, automatic quantization selection, and an OpenAI-compatible REST API out of the box. It abstracts away model sharding, quantization configuration, and the llama.cpp backend — you get a clean CLI and HTTP interface without needing to understand the internals. As of May 2026, Ollama supports CUDA (NVIDIA), ROCm (AMD), Metal (Apple Silicon), and CPU-only inference. Installation is straightforward across all three major operating systems, and the entire setup from zero to running model takes under 10 minutes on a fast internet connection. Ollama version 0.5+ is required for Gemma 4 support — older versions do not have the model architecture registered.

Mac:

brew install ollama
# or download the .dmg from ollama.com
ollama serve  # starts the background server

Linux:

curl -fsSL https://ollama.com/install.sh | sh
# Automatically installs CUDA drivers if NVIDIA GPU detected
# Service starts automatically via systemd

Windows: Download the installer from ollama.com. The installer configures a background Windows service and adds ollama to PATH. CUDA support requires NVIDIA drivers 525.85+.

Verify the install:

ollama --version
# Should output: ollama version 0.5.x or higher

Step 2 — Pull and Run the Gemma 4 31B Model with Ollama

Pulling Gemma 4 31B downloads approximately 18–20GB of model weights in Q4_K_M format. Ollama handles quantization and model registration automatically — no manual GGUF conversion or configuration required. The model is pulled from Ollama’s model registry, which mirrors the Hugging Face checkpoint in a pre-quantized GGUF format. On a 500 Mbps connection, the download takes roughly 5–7 minutes. Once complete, the model is cached locally in ~/.ollama/models/ and subsequent loads are instant. The Gemma 4 31B Ollama tag is gemma4:31b — note this differs from the Hugging Face naming convention.

# Pull the 31B Dense model (Q4_K_M by default, ~18GB)
ollama pull gemma4:31b

# Run an interactive chat session
ollama run gemma4:31b

# Example prompt after model loads:
>>> Explain the difference between dense and MoE transformer architectures.

Other variants:

ollama pull gemma4:2b    # E2B edge model
ollama pull gemma4:4b    # E4B edge model
ollama pull gemma4:12b   # 12B standard
ollama pull gemma4:26b   # 26B standard

Check which models are installed:

ollama list

Stop a running model session:

# In the chat, press Ctrl+D or type /bye
# To stop the background Ollama server:
ollama stop gemma4:31b

Running Multiple Prompts via the CLI

# Non-interactive single prompt
ollama run gemma4:31b "Write a Python function that parses JSON from a REST API response"

# Pipe stdin for batch processing
echo "Summarize this text: $(cat document.txt)" | ollama run gemma4:31b

Step 3 — Set Up Open WebUI for a ChatGPT-Like Interface

Open WebUI is an open-source browser interface that connects directly to Ollama, providing a polished chat experience with conversation history, model switching, file uploads, and system prompt configuration — all running locally. It runs as a Docker container and takes under 2 minutes to set up once Docker is installed. The interface is accessible at http://localhost:3000 and supports multiple users, making it useful for team deployments on a local network where a shared Gemma 4 instance is hosted on a single powerful machine. Open WebUI automatically detects all models registered in Ollama, so switching between the 12B and 31B variants is a dropdown selection in the interface.

Prerequisites: Docker Desktop (Mac/Windows) or Docker Engine (Linux).

# Pull and start Open WebUI with Ollama auto-detection
docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

# Access at: http://localhost:3000

On first launch, create an admin account (local only — no external services involved). Under Settings → Models, Gemma 4 31B should appear automatically if Ollama is running. Select it as the default model and start chatting.

Using the Gemma 4 31B Local API (OpenAI-Compatible)

Ollama exposes an OpenAI-compatible REST API at http://localhost:11434/v1, allowing any tool or application that supports the OpenAI SDK to use Gemma 4 31B as a drop-in replacement. This means you can point VS Code extensions like Continue, Python scripts using the openai library, or LangChain pipelines directly at your local Gemma 4 31B instance without modifying code — just change the base URL and set the API key to any non-empty string (Ollama ignores it but the SDK requires a value). This makes Gemma 4 31B an immediately usable private coding assistant with zero monthly cost, zero rate limits, and no data ever leaving your machine.

Python (OpenAI SDK):

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored by Ollama
)

response = client.chat.completions.create(
    model="gemma4:31b",
    messages=[
        {"role": "user", "content": "Review this Python function for bugs: def parse(x): return x['data']"}
    ]
)
print(response.choices[0].message.content)

curl:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4:31b",
    "messages": [{"role": "user", "content": "Hello, Gemma 4!"}]
  }'

Native Ollama API (also available):

curl http://localhost:11434/api/generate \
  -d '{"model": "gemma4:31b", "prompt": "Explain gradient descent"}'

Streaming Responses

stream = client.chat.completions.create(
    model="gemma4:31b",
    messages=[{"role": "user", "content": "Write a FastAPI endpoint for user authentication"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

Gemma 4 31B Benchmarks: How It Stacks Up Against GPT-4o and Llama 4

Gemma 4 31B Dense achieves state-of-the-art results for its parameter count, posting 87.1% on MMLU versus GPT-4o’s 86.5% — a meaningful reversal given the cost difference (free vs. API pricing). On GPQA Diamond, a graduate-level science benchmark that measures genuine reasoning depth, Gemma 4 31B scores 84.3%, compared to Llama 4 Scout’s 74.3% despite Scout having a 109B total parameter count. The AIME 2026 score of 89.2% places it among the top tier of math-capable models available to run without an API. As of April 2026, Gemma 4 31B ranks #3 on the Chatbot Arena (LMSYS) leaderboard — the only fully open model in the top five. This makes it the strongest option for teams that need GPT-4o-class reasoning performance in an air-gapped or privacy-first deployment.

Benchmark	Gemma 4 31B	GPT-4o	Llama 4 Scout (109B MoE)	Claude 3.7 Sonnet
MMLU	87.1%	86.5%	83.2%	88.3%
GPQA Diamond	84.3%	83.4%	74.3%	84.8%
AIME 2026	89.2%	83.1%	67.4%	86.5%
HumanEval	85.4%	87.0%	79.3%	86.1%
Arena Rank	#3	#2	#7	#1

Benchmarks sourced from Google DeepMind release notes and third-party evaluations, April–May 2026.

Optimization Tips: Quantization, GPU Layers, and Context Window Tuning

Ollama’s default Q4_K_M quantization is the right choice for most users, reducing VRAM usage by 55–60% versus FP16 with minimal quality degradation. But beyond quantization format, there are several settings worth tuning to maximize performance on your specific hardware. The most impactful variable is GPU layer offloading (num_gpu) — Ollama automatically offloads as many layers as fit in VRAM, but you can override this with an Modelfile. Context window size (num_ctx) also directly affects VRAM usage: Gemma 4 31B supports 256K tokens, but setting a 4K or 8K context for coding tasks frees significant memory for additional parallel requests.

Create a custom Modelfile for tuned inference:

Build and run the custom model:

ollama create gemma4-coding -f Modelfile
ollama run gemma4-coding

Quantization options and trade-offs:

Format	VRAM (31B)	Quality	Speed
FP16	~62 GB	Best	Fastest per token
Q8_0	~33 GB	Near-lossless	Fast
Q4_K_M	~18–24 GB	Good (default)	Good
Q4_0	~17 GB	Slightly lower	Slightly faster
Q3_K_M	~14 GB	Acceptable	Fast on low VRAM

Monitoring GPU Utilization

# NVIDIA
watch -n 1 nvidia-smi

# Mac (using powermetrics or mactop)
mactop

# Check Ollama model status
ollama ps

Common Errors and Fixes When Running Gemma 4 31B Locally

Most failures when running Gemma 4 31B locally fall into three categories: insufficient VRAM causing OOM errors, Ollama version mismatches that predate Gemma 4 support, and port conflicts preventing the API from starting. These are all straightforward to diagnose and fix — Ollama’s error messages are specific enough to point directly to the root cause in most cases. The most common mistake is attempting to run the 31B model on a GPU with less than 20GB VRAM without adjusting quantization. The second most common is running Ollama 0.4.x, which predates the gemma4 model tag and returns a “model not found” error regardless of what you pull.

Error: CUDA out of memory or error: model requires more system memory

# Check available VRAM
nvidia-smi --query-gpu=memory.free,memory.total --format=csv

# Solution: Force a lower quantization by pulling a specific GGUF tag
ollama pull gemma4:31b-q3_k_m  # ~14GB VRAM

# Or switch to the 26B model
ollama pull gemma4:26b

Error: model "gemma4:31b" not found

# Check Ollama version (needs 0.5+)
ollama --version

# Update Ollama
curl -fsSL https://ollama.com/install.sh | sh  # Linux
brew upgrade ollama  # Mac

Error: listen tcp :11434: bind: address already in use

# Another process is using port 11434
lsof -i :11434
kill -9 
ollama serve

Slow generation speed (< 5 tok/s on RTX 4090)

# Verify GPU is being used, not CPU
ollama ps  # shows active model and runner type

# If showing "cpu" runner, CUDA drivers may not be detected
# Reinstall with CUDA drivers:
OLLAMA_SKIP_GPU=false ollama serve

Model loads but produces garbage output

# Corrupted model file — re-pull
ollama rm gemma4:31b
ollama pull gemma4:31b

FAQ

The following questions address the most common issues and misconceptions when setting up Gemma 4 31B locally with Ollama. Hardware compatibility is the most frequent stumbling block — specifically the gap between a model’s FP16 memory footprint and its quantized footprint. Gemma 4 31B at Q4_K_M requires roughly 18–24GB VRAM, not the 62GB you would need for FP16, which changes the hardware requirements dramatically. Other common points of confusion include the model variant naming (no “27B” variant exists in Gemma 4), offline operation capabilities (the model runs entirely air-gapped after the initial download completes), CPU fallback behavior when no compatible GPU is present, and licensing terms for commercial deployments. The Apache 2.0 license makes Gemma 4 31B fully usable in production environments without royalties or usage restrictions, which distinguishes it from some other open-weight models with more restrictive non-commercial terms.

Does Gemma 4 31B require an internet connection after download?

No. Once ollama pull gemma4:31b completes, the model runs entirely offline. Ollama stores the weights in ~/.ollama/models/ and inference happens locally with no network calls. You can disconnect your machine from the internet and the model continues to work normally.

Can I run Gemma 4 31B on a CPU without a GPU?

Yes, but it will be very slow. Ollama falls back to CPU inference automatically if no compatible GPU is detected. Expect 1–3 tokens per second on a modern desktop CPU versus 30–60+ tokens per second on an RTX 4090. For practical use, a GPU with at least 20GB VRAM is strongly recommended for the 31B variant.

What is the difference between Gemma 4 31B and Gemma 4 27B?

There is no official “27B” variant in the Gemma 4 family. The lineup is E2B, E4B, 12B, 26B, and 31B. Some confusion arises because earlier Gemma 2 had a 27B model. Gemma 4 31B is the top-tier dense model in the current release.

How do I update Gemma 4 to a newer version when Google releases one?

Run ollama pull gemma4:31b again. Ollama checks the registry for a newer manifest and downloads only the changed layers if an update is available. You can also use ollama pull gemma4:latest to always fetch the most recent Gemma 4 variant automatically.

Is Gemma 4 31B safe to use in production with real user data?

Gemma 4 31B is Apache 2.0 licensed, so commercial use is permitted without restriction. For production deployments handling sensitive user data, running it locally with Ollama is actually the privacy-correct approach — no data is sent to third-party servers. However, like all language models, it can produce hallucinations and should not be used for safety-critical decisions without human review and output validation.

Gemma 4 Review 2026: Google's Best Open-Source Model Yet?

Thu, 07 May 2026 03:04:21 +0000

Gemma 4 is Google DeepMind’s 2026 open-source model family — four model sizes from 2B (phone-optimized) to 31B dense, all under Apache 2.0, scoring 89.2% on AIME 2026 and ranking #3 on the Arena AI leaderboard. If you’re evaluating open-weight models for production use today, Gemma 4 is the most commercially viable and technically competitive option available.

What Is Gemma 4? Google’s Open-Source Flagship Explained

Gemma 4 is Google DeepMind’s fourth-generation open-weight language model family, released on April 2, 2026, designed to cover the full deployment spectrum — from on-device inference on smartphones to large-scale server workloads. Unlike prior Gemma generations, Gemma 4 ships with genuine frontier-model performance: the 31B dense variant scores 84.3% on GPQA Diamond, outperforming Meta’s Llama 4 Scout (109B) at 74.3%, and reaching 89.2% on the AIME 2026 math benchmark — a figure that was 20.8% just one generation earlier. The model family is multimodal (vision + audio input on edge models), multilingual (140+ languages), and supports context windows up to 256K tokens. Since Google’s first Gemma release, developers have downloaded Gemma models over 400 million times, and the Gemmaverse now includes over 100,000 community-created fine-tunes and variants. That ecosystem depth means production-grade LoRA adapters, GGUF quants, and tool integrations are available day one — not months later. Gemma 4 is the model to benchmark any other open-weight model against in 2026.

Gemma 4 Model Variants: E2B, E4B, 26B MoE, and 31B Dense

Gemma 4 ships as four distinct model sizes, each targeting a different hardware tier. The E2B (2B parameters) and E4B (4B parameters) are edge-optimized models built for mobile, IoT, and Raspberry Pi — the E2B achieves 3,700 prefill and 31 decode tokens per second on a Qualcomm Dragonwing IQ8 NPU, making real-time on-device inference viable for the first time in a frontier-class model family. Both edge variants support 128K context and multimodal input including audio. The 26B Mixture-of-Experts (MoE) model activates a fraction of its total parameters per forward pass, offering a better compute-per-quality tradeoff for mid-tier GPU servers — it ranks #6 on the Arena AI text leaderboard. The 31B Dense model is the flagship, activating all 31 billion parameters on each pass and delivering the best single-model quality of the family; it holds Arena AI #3 and beats models three to ten times its parameter count in benchmark-to-benchmark comparisons. All four models are distributed under Apache 2.0 with no maximum active user (MAU) restrictions, making them drop-in replacements for proprietary APIs in commercial products.

Model	Parameters	Context	Best For	Arena Rank
E2B	2B	128K	Mobile / IoT	—
E4B	4B	128K	Edge servers / Raspberry Pi	—
26B MoE	26B active	256K	Mid-tier GPU workloads	#6
31B Dense	31B	256K	Best quality, production API	#3

Key Features — Multimodal, Multilingual, and 256K Context

Gemma 4 is the first Gemma generation to treat multimodality and multilingualism as first-class features rather than add-ons. The model was natively trained on over 140 languages — not post-trained via translation alignment — which means it generalizes better to low-resource languages like Swahili or Tagalog without the performance cliff common in English-centric models. Larger variants (26B MoE and 31B Dense) support a 256K token context window, enabling full-book RAG, multi-file code analysis, and long-form document summarization without chunking. Edge variants (E2B, E4B) handle images and audio as input, useful for mobile applications that need a local vision-language model without cloud round-trips. The model supports structured output modes (JSON schema enforcement), tool calling, and an agentic execution format compatible with LangChain, LlamaIndex, and Google’s own Agent Development Kit (ADK). Practically speaking, this means Gemma 4 slots directly into existing LLM pipelines — you can swap a Gemini or GPT-4 API call for a self-hosted Gemma 4 endpoint with minimal prompt engineering changes.

256K Context in Practice

The 256K context window means you can feed a full codebase, a legal contract library, or a year’s worth of customer support tickets in a single prompt. In practice, retrieval quality on long contexts degrades less than GPT-4 Turbo in the 100K–200K range based on “lost in the middle” evaluations — Gemma 4 maintains retrieval accuracy at 82% at the 200K position vs GPT-4 Turbo’s 71%. That’s a meaningful difference for RAG-heavy applications where context length isn’t just a checkbox.

Gemma 4 Benchmark Results: How Good Is It Really?

Gemma 4’s benchmark numbers represent the largest single-generation leap in the open-weight model ecosystem since the original Llama 2 release. On AIME 2026 (college-level math olympiad), the 31B model scores 89.2% — compared to Gemma 3’s 20.8%, that’s a 68-point jump in one generation. On LiveCodeBench v6 (competitive coding), Gemma 4 scores 80.0% vs 29.1% for Gemma 3 and 77.1% for Llama 4. On Codeforces ELO (programming contest simulation), the model went from 110 to 2,150 — moving from hobbyist-level to expert competitive programmer. MMLU (broad knowledge across 57 subjects) comes in at 87.1%, beating GPT-4’s 86.5% while running entirely on local hardware at zero marginal API cost. GPQA Diamond (doctoral-level science questions) sits at 84.3%, a 10-point lead over Llama 4 Scout. These aren’t cherry-picked metrics — Gemma 4’s gains are consistent across math, science, coding, and language tasks.

Benchmark	Gemma 4 31B	Gemma 3	Llama 4 Scout	GPT-4
AIME 2026	89.2%	20.8%	~75%	~72%
LiveCodeBench v6	80.0%	29.1%	77.1%	~74%
GPQA Diamond	84.3%	—	74.3%	79.4%
MMLU	87.1%	—	~82%	86.5%
Codeforces ELO	2,150	110	~1,900	—

What’s Behind the Gemma 3 → Gemma 4 Leap?

The jump from 20.8% to 89.2% AIME isn’t mysterious — Google invested heavily in two areas: chain-of-thought alignment using reinforcement learning from verifiable rewards (RLVR), and synthetic math data generation at scale. The same approach drove similar gains in Gemini 2.0 Flash Thinking. Essentially, Google solved the same problem OpenAI solved with o1, then distilled the reasoning capability into an open-weight model available to anyone with a GPU.

Gemma 4 vs Llama 4 vs GPT-4 vs Claude — Who Wins?

Gemma 4 is the most competitive open-weight model in 2026, but “wins” depends heavily on the task and your deployment constraints. Against Llama 4 Scout (109B, Meta’s midrange model), Gemma 4 31B is smaller, faster to serve, and scores higher on every benchmark listed above — while Llama 4 has a 10M MAU commercial restriction, Gemma 4 has none. Against GPT-4, Gemma 4 31B matches or slightly exceeds performance on most benchmarks while costing nothing in API fees if self-hosted. The caveat: GPT-4 has better tooling, broader third-party integration, and no self-hosting burden. Against Claude 3.5 Sonnet, Gemma 4 trails on multi-step reasoning chains and creative writing tasks but is competitive on coding and factual recall. Against Qwen 3.5 27B (the strongest China-origin open model), Gemma 4 loses on SWE-bench Verified — Qwen’s software engineering performance is currently superior — but Gemma 4 leads on multilingual tasks and edge deployment options.

Use Case	Winner	Why
On-device / mobile	Gemma 4 E2B/E4B	Only frontier-grade model optimized for NPUs
Math / science reasoning	Gemma 4 31B	89.2% AIME, 84.3% GPQA
Software engineering tasks	Qwen 3.5 27B	Higher SWE-bench Verified score
No-restriction commercial use	Gemma 4	Apache 2.0, no MAU cap
Least operational burden	GPT-4 / Claude	No self-hosting needed
Multilingual NLP	Gemma 4	140+ natively trained languages

On-Device and Edge Deployment: Running Gemma 4 Locally

Gemma 4 is the only open model family in 2026 that genuinely spans from phones to data center servers under a single Apache 2.0 license. On a Qualcomm Dragonwing IQ8 NPU, the E2B model achieves 3,700 prefill tokens per second and 31 decode tokens per second — fast enough for real-time chat, live transcription assistance, and local document QA without cloud round-trips. On a MacBook Pro M3 with 36GB unified memory, the 31B dense model runs at approximately 25 tokens per second with llama.cpp’s Metal backend, making it comfortable for developer use. On an NVIDIA RTX 4090 (24GB VRAM), the 31B model fits in 4-bit quantization and runs at ~55 tokens per second, suitable for local API servers. Day-one support spans Hugging Face Transformers, Ollama, vLLM, llama.cpp, and NVIDIA NIM — no custom inference infrastructure is required. For privacy-sensitive applications (healthcare, legal, finance), the ability to run a GPT-4-class model with zero data leaving the premises is the decisive factor, and Gemma 4 is the only model family that delivers this at every hardware tier.

Quick Start with Ollama

# Pull and run Gemma 4 31B locally
ollama pull gemma4:31b
ollama run gemma4:31b "Explain quantum entanglement in 3 sentences"

# Edge model for Raspberry Pi / low-memory devices
ollama pull gemma4:e4b
ollama run gemma4:e4b

The E4B variant runs on 8GB RAM, making it viable on a Raspberry Pi 5 or any machine with 8GB+ of memory.

Apache 2.0 License — Why It Matters for Developers and Enterprises

Apache 2.0 is the gold standard for open-source commercial use, and Gemma 4’s adoption of it without any active user restrictions is the most commercially significant licensing decision in the open-weight model space since Falcon’s MIT release. Meta’s Llama 4 license caps commercial use at 700 million monthly active users — a restriction that affects only a handful of companies today but signals Meta’s intent to extract licensing revenue as models become infrastructure. Mistral’s licenses have historically included use-case carve-outs. Gemma 4 imposes none of these restrictions. You can build a commercial product, embed it in enterprise software, redistribute model weights, and fine-tune for any vertical without royalty payments, revenue share, or usage caps. For startups especially, this matters: you’re not betting your product’s legal foundation on a company’s continued goodwill or future license amendments. For enterprises with legal teams that require OSI-approved licenses for vendor dependency review, Apache 2.0 is the only answer — and Gemma 4 is the best-performing Apache 2.0 model available in 2026. The Gemmaverse’s 100,000+ community variants also mean that if you need a fine-tuned model for your vertical (medical, legal, code), there’s almost certainly an Apache 2.0 derivative already available on Hugging Face.

Gemma 4 Limitations and Weaknesses You Should Know

Gemma 4 is the best open-weight model in 2026, but it has real limitations that should inform deployment decisions. First, there is no native speech output — the E2B and E4B models accept audio input but cannot generate audio, requiring a separate TTS pipeline for voice applications. Second, the model has a fixed knowledge cutoff with no internet access; for applications requiring real-time information retrieval, you’ll need to wire up a RAG pipeline or tool-call layer. Third, self-hosting shifts operational responsibility to you: fine-tuning, weight management, serving infrastructure, uptime, and security are all your problem. That’s valuable for privacy and cost at scale, but it’s a meaningful engineering overhead compared to a managed API. Fourth, on SWE-bench Verified (real-world software engineering tasks), Gemma 4 trails Qwen 3.5 27B — if software engineering automation is your primary use case, Qwen deserves evaluation. Fifth, while Codeforces ELO is strong at 2,150, complex multi-file refactoring and codebase-level reasoning remain areas where Claude 3.7 Sonnet and GPT-4.1 pull ahead. These are real tradeoffs, not dealbreakers — but understanding them prevents over-application of the model.

Known Limitations Summary

No audio output (input only on E2B/E4B)
Fixed knowledge cutoff, no web access
Self-hosting burden: infra, updates, and security are on you
Trails Qwen 3.5 27B on SWE-bench Verified
Complex multi-file refactoring: Claude 3.7 Sonnet still leads

Who Should Use Gemma 4? Practical Recommendations

Gemma 4 is the right choice for four specific developer and enterprise profiles, and the wrong choice for two others. If you are building mobile or edge AI applications, Gemma 4 E2B/E4B is the only production-grade option — no other frontier model family runs on Qualcomm NPUs at 3,700 tokens/second. If you are building privacy-sensitive applications in healthcare, legal, or finance where data cannot leave your infrastructure, the 31B dense model delivers GPT-4-class performance with zero cloud dependency. If you are a startup or enterprise that needs Apache 2.0 with no user caps, Gemma 4 is the only frontier model that qualifies. If you need strong multilingual support for 140+ languages, Gemma 4’s native language training beats every other open-weight alternative. Gemma 4 is the wrong choice if you need zero operational overhead — in that case, the managed Claude or GPT-4 APIs are simpler. It’s also the wrong first choice if software engineering automation (automated code review, PR generation, issue resolution) is your core use case; benchmark Qwen 3.5 27B alongside Gemma 4 before committing.

Recommended for:

Mobile / IoT / edge AI deployments
Privacy-first applications (HIPAA, GDPR, finance)
Commercial products needing Apache 2.0 at any scale
Multilingual NLP applications
Math, science, and coding assistants

Consider alternatives for:

Automated software engineering (evaluate Qwen 3.5 27B)
Zero-infrastructure managed API (use Claude or GPT-4)

Final Verdict: Is Gemma 4 Google’s Best Open-Source Model Yet?

Gemma 4 is definitively Google’s best open-source model and the strongest open-weight model family released in 2026. The combination of 89.2% AIME performance, Arena AI #3 ranking, a 256K context window, genuine edge deployment to phones and IoT devices, and an unrestricted Apache 2.0 license has no equivalent in the open-weight ecosystem. The Gemma 3 → Gemma 4 leap — driven by RLVR training and synthetic reasoning data — demonstrates that Google has solved the reasoning gap that made Gemma 3 a second-tier choice. The 400M+ download history and 100,000+ community variants mean production infrastructure, tooling, and domain-specific fine-tunes exist now. If you were waiting for an open-weight model that could realistically replace a proprietary API for most production workloads, Gemma 4 is that model. The primary caveat is operational: self-hosting is still non-trivial, and for teams without ML infrastructure expertise, the managed API path remains more practical despite the cost and privacy tradeoffs. But for developers and enterprises who have made the infrastructure investment, Gemma 4 is the model to run in 2026.

Bottom line: If you’re evaluating open-weight models today, start with Gemma 4 31B. It outperforms everything at its parameter count, holds a license that never expires or changes, and runs on hardware you probably already have.

FAQ

Is Gemma 4 free to use commercially? Yes. Gemma 4 is released under Apache 2.0 with no active user caps, no revenue share, and no royalty requirements. You can build and ship commercial products using Gemma 4 weights without any licensing fees or usage restrictions.

How does Gemma 4 compare to Llama 4? Gemma 4 31B outperforms Llama 4 Scout (109B) on GPQA Diamond (84.3% vs 74.3%), LiveCodeBench v6 (80.0% vs 77.1%), and AIME 2026. Gemma 4 also has no MAU commercial restrictions vs Llama 4’s 700M MAU cap, and it genuinely supports on-device deployment which Llama 4 does not.

Can Gemma 4 run on a laptop? Yes. The E4B model runs on 8GB RAM (laptop-viable), the 26B MoE runs well on a machine with 24GB+ RAM or VRAM, and the 31B Dense runs on a MacBook Pro M3 with 36GB unified memory at ~25 tokens/second with Ollama.

What is Gemma 4’s context window? The 26B MoE and 31B Dense models support 256K tokens. The edge models (E2B, E4B) support 128K tokens. At 256K, the model can process approximately 200,000 words — roughly three full novels — in a single prompt.

Does Gemma 4 support multimodal inputs? Yes. The E2B and E4B edge models accept image and audio inputs. The 26B MoE and 31B Dense models accept image inputs. None of the current Gemma 4 variants generate audio or image outputs — text output only.