<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Gemma 4 on RockB</title><link>https://baeseokjae.github.io/tags/gemma-4/</link><description>Recent content in Gemma 4 on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Thu, 07 May 2026 06:04:01 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/gemma-4/index.xml" rel="self" type="application/rss+xml"/><item><title>Run Gemma 4 Locally in 2026: 31B Dense Setup Guide with Ollama</title><link>https://baeseokjae.github.io/posts/gemma-4-local-setup-guide-2026/</link><pubDate>Thu, 07 May 2026 06:04:01 +0000</pubDate><guid>https://baeseokjae.github.io/posts/gemma-4-local-setup-guide-2026/</guid><description>Step-by-step guide to running Gemma 4 31B Dense locally with Ollama — hardware requirements, installation, Open WebUI, and API usage.</description><content:encoded><![CDATA[<p>Gemma 4 31B Dense runs locally on a single RTX 4090 or Mac M3 Max using Ollama — no API key, no data leaving your machine. Install Ollama, run <code>ollama pull gemma4:31b</code>, and you have a model that scores 87.1% on MMLU, beating GPT-4o&rsquo;s 86.5%, running entirely on your hardware.</p>
<h2 id="what-is-gemma-4-31b-dense-and-why-run-it-locally">What Is Gemma 4 31B Dense and Why Run It Locally?</h2>
<p>Gemma 4 31B Dense is a 31-billion-parameter language model released by Google DeepMind on April 2, 2026, under the Apache 2.0 license. Unlike mixture-of-experts architectures that distribute parameters across sparse expert layers, the 31B Dense model activates all 31 billion parameters on every token — giving it more reliable reasoning depth than larger MoE models with similar active parameter counts. In benchmark testing, Gemma 4 31B scores 87.1% on MMLU (beating GPT-4o&rsquo;s 86.5%), 89.2% on AIME 2026, and 84.3% on GPQA Diamond — outperforming Llama 4 Scout&rsquo;s 109B MoE model on the harder science benchmarks. Running it locally means zero API costs, complete data privacy, no rate limits, and the ability to integrate with any tool via the OpenAI-compatible REST endpoint that Ollama exposes on <code>localhost:11434</code>. For developers, researchers, or privacy-conscious users, this is the highest-performing open model available for on-device inference as of mid-2026.</p>
<h3 id="dense-vs-moe-why-the-architecture-matters-for-local-inference">Dense vs. MoE: Why the Architecture Matters for Local Inference</h3>
<p>A dense model like Gemma 4 31B activates every parameter on every forward pass. An MoE model like Llama 4 Scout (109B total, ~17B active) routes each token through only a subset of expert layers. For local inference, the dense architecture has a decisive advantage: total VRAM needed corresponds directly to the active parameter count. With Q4_K_M quantization — Ollama&rsquo;s default — Gemma 4 31B fits in approximately 24GB VRAM, which is exactly what a single RTX 4090 or RTX 6000 Ada provides. A 109B MoE model at the same quantization still requires routing infrastructure and substantially more memory even if active parameters are lower, making it harder to run on consumer hardware without CPU offloading.</p>
<h2 id="gemma-4-model-variants-e2b-e4b-26b-and-31b-compared">Gemma 4 Model Variants: E2B, E4B, 26B, and 31B Compared</h2>
<p>Gemma 4 ships in four variants with meaningfully different hardware requirements and capability profiles. The E2B (2B Edge) and E4B (4B Edge) models are designed for mobile and embedded deployment — they feature native audio input and a 128K context window, making them unique among the family. The 26B and 31B models target server and workstation deployment, both supporting a 256K token context window and excelling at multi-step reasoning, coding, and mathematics. The 31B Dense specifically is the flagship for local deployment: it is natively trained on over 140 languages, released under Apache 2.0, and achieves GPT-4o-class performance on a single high-end consumer GPU. The choice between variants comes down almost entirely to available VRAM, since quality scales predictably across the lineup.</p>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Active Params</th>
          <th>VRAM (Q4_K_M)</th>
          <th>VRAM (FP16)</th>
          <th>Context</th>
          <th>Best For</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>E2B</td>
          <td>2B</td>
          <td>~1.5 GB</td>
          <td>~4 GB</td>
          <td>128K</td>
          <td>Mobile, edge devices</td>
      </tr>
      <tr>
          <td>E4B</td>
          <td>4B</td>
          <td>~2.8 GB</td>
          <td>~8 GB</td>
          <td>128K</td>
          <td>Laptop CPU/integrated GPU</td>
      </tr>
      <tr>
          <td>12B</td>
          <td>12B</td>
          <td>~6.6 GB</td>
          <td>~24 GB</td>
          <td>128K</td>
          <td>RTX 3060, M2 MacBook</td>
      </tr>
      <tr>
          <td>26B</td>
          <td>26B</td>
          <td>~14 GB</td>
          <td>~52 GB</td>
          <td>256K</td>
          <td>RTX 3090, M3 Pro</td>
      </tr>
      <tr>
          <td>31B</td>
          <td>31B</td>
          <td>~18–24 GB</td>
          <td>~62 GB</td>
          <td>256K</td>
          <td>RTX 4090, M3 Max, M4 Ultra</td>
      </tr>
  </tbody>
</table>
<h3 id="which-variant-should-you-pick">Which Variant Should You Pick?</h3>
<p>If you have 24GB VRAM (RTX 4090, RTX 6000 Ada) or 32GB+ unified memory (M3 Max, M4 Pro/Max), run the 31B. If you have 16GB VRAM (RTX 4080, A4000), run the 26B at Q4_K_M. For anything with 8–12GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB), the 12B variant is the correct choice — it requires only 6.6GB VRAM at Q4 quantization and delivers strong coding and reasoning performance. The E2B and E4B are specifically for devices without a discrete GPU.</p>
<h2 id="hardware-requirements-for-gemma-4-31b-vram-ram-cpu">Hardware Requirements for Gemma 4 31B (VRAM, RAM, CPU)</h2>
<p>Gemma 4 31B Dense requires 24GB VRAM at Q4_K_M quantization or 62GB VRAM at full FP16 precision. In practice, Q4_K_M is the correct target for consumer hardware: Ollama defaults to this quantization automatically, reducing memory usage by approximately 55–60% compared to FP16, with only a marginal quality drop that is typically imperceptible in conversational and coding tasks. The minimum viable GPU is a single RTX 4090 (24GB). For Mac users, the M3 Max (36GB or 48GB unified memory) and M4 Pro/Max provide excellent performance because Apple Silicon shares memory between CPU and GPU — you can run the 31B comfortably with 36GB total unified memory. Linux workstations with dual RTX 3090s (24GB each) can also run the 31B by splitting the model across GPUs, though this requires additional configuration and results in slower inference than a single 4090.</p>
<table>
  <thead>
      <tr>
          <th>GPU / Platform</th>
          <th>VRAM / Unified Memory</th>
          <th>Gemma 4 31B (Q4)?</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RTX 4090</td>
          <td>24 GB</td>
          <td>Yes</td>
          <td>Ideal single-GPU setup</td>
      </tr>
      <tr>
          <td>RTX 6000 Ada</td>
          <td>48 GB</td>
          <td>Yes</td>
          <td>Runs FP16 too</td>
      </tr>
      <tr>
          <td>RTX 4080</td>
          <td>16 GB</td>
          <td>No</td>
          <td>Use 26B instead</td>
      </tr>
      <tr>
          <td>RTX 3090 x2</td>
          <td>48 GB total</td>
          <td>Yes</td>
          <td>Slower, split model</td>
      </tr>
      <tr>
          <td>M3 Max 36GB</td>
          <td>36 GB unified</td>
          <td>Yes</td>
          <td>Excellent tok/s</td>
      </tr>
      <tr>
          <td>M4 Max 64GB</td>
          <td>64 GB unified</td>
          <td>Yes</td>
          <td>Can run FP16</td>
      </tr>
      <tr>
          <td>M2 MacBook Pro 16GB</td>
          <td>16 GB unified</td>
          <td>No</td>
          <td>Use 12B instead</td>
      </tr>
  </tbody>
</table>
<p><strong>System RAM:</strong> Ollama also uses system RAM for the context cache. Aim for at least 32GB system RAM when running 31B. CPU doesn&rsquo;t significantly affect generation speed once the model is loaded into VRAM — but fast NVMe SSD storage (PCIe 4.0+) reduces initial model load time from cold.</p>
<h2 id="step-1--install-ollama-on-mac-windows-or-linux">Step 1 — Install Ollama on Mac, Windows, or Linux</h2>
<p>Ollama is the fastest path to running Gemma 4 31B locally, providing a one-command model download, automatic quantization selection, and an OpenAI-compatible REST API out of the box. It abstracts away model sharding, quantization configuration, and the llama.cpp backend — you get a clean CLI and HTTP interface without needing to understand the internals. As of May 2026, Ollama supports CUDA (NVIDIA), ROCm (AMD), Metal (Apple Silicon), and CPU-only inference. Installation is straightforward across all three major operating systems, and the entire setup from zero to running model takes under 10 minutes on a fast internet connection. Ollama version 0.5+ is required for Gemma 4 support — older versions do not have the model architecture registered.</p>
<p><strong>Mac:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>brew install ollama
</span></span><span style="display:flex;"><span><span style="color:#75715e"># or download the .dmg from ollama.com</span>
</span></span><span style="display:flex;"><span>ollama serve  <span style="color:#75715e"># starts the background server</span>
</span></span></code></pre></div><p><strong>Linux:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl -fsSL https://ollama.com/install.sh | sh
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Automatically installs CUDA drivers if NVIDIA GPU detected</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Service starts automatically via systemd</span>
</span></span></code></pre></div><p><strong>Windows:</strong>
Download the installer from <a href="https://ollama.com">ollama.com</a>. The installer configures a background Windows service and adds <code>ollama</code> to PATH. CUDA support requires NVIDIA drivers 525.85+.</p>
<p><strong>Verify the install:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama --version
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Should output: ollama version 0.5.x or higher</span>
</span></span></code></pre></div><h2 id="step-2--pull-and-run-the-gemma-4-31b-model-with-ollama">Step 2 — Pull and Run the Gemma 4 31B Model with Ollama</h2>
<p>Pulling Gemma 4 31B downloads approximately 18–20GB of model weights in Q4_K_M format. Ollama handles quantization and model registration automatically — no manual GGUF conversion or configuration required. The model is pulled from Ollama&rsquo;s model registry, which mirrors the Hugging Face checkpoint in a pre-quantized GGUF format. On a 500 Mbps connection, the download takes roughly 5–7 minutes. Once complete, the model is cached locally in <code>~/.ollama/models/</code> and subsequent loads are instant. The Gemma 4 31B Ollama tag is <code>gemma4:31b</code> — note this differs from the Hugging Face naming convention.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Pull the 31B Dense model (Q4_K_M by default, ~18GB)</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:31b
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Run an interactive chat session</span>
</span></span><span style="display:flex;"><span>ollama run gemma4:31b
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Example prompt after model loads:</span>
</span></span><span style="display:flex;"><span>&gt;&gt;&gt; Explain the difference between dense and MoE transformer architectures.
</span></span></code></pre></div><p><strong>Other variants:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama pull gemma4:2b    <span style="color:#75715e"># E2B edge model</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:4b    <span style="color:#75715e"># E4B edge model</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:12b   <span style="color:#75715e"># 12B standard</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:26b   <span style="color:#75715e"># 26B standard</span>
</span></span></code></pre></div><p><strong>Check which models are installed:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama list
</span></span></code></pre></div><p><strong>Stop a running model session:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># In the chat, press Ctrl+D or type /bye</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># To stop the background Ollama server:</span>
</span></span><span style="display:flex;"><span>ollama stop gemma4:31b
</span></span></code></pre></div><h3 id="running-multiple-prompts-via-the-cli">Running Multiple Prompts via the CLI</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Non-interactive single prompt</span>
</span></span><span style="display:flex;"><span>ollama run gemma4:31b <span style="color:#e6db74">&#34;Write a Python function that parses JSON from a REST API response&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Pipe stdin for batch processing</span>
</span></span><span style="display:flex;"><span>echo <span style="color:#e6db74">&#34;Summarize this text: </span><span style="color:#66d9ef">$(</span>cat document.txt<span style="color:#66d9ef">)</span><span style="color:#e6db74">&#34;</span> | ollama run gemma4:31b
</span></span></code></pre></div><h2 id="step-3--set-up-open-webui-for-a-chatgpt-like-interface">Step 3 — Set Up Open WebUI for a ChatGPT-Like Interface</h2>
<p>Open WebUI is an open-source browser interface that connects directly to Ollama, providing a polished chat experience with conversation history, model switching, file uploads, and system prompt configuration — all running locally. It runs as a Docker container and takes under 2 minutes to set up once Docker is installed. The interface is accessible at <code>http://localhost:3000</code> and supports multiple users, making it useful for team deployments on a local network where a shared Gemma 4 instance is hosted on a single powerful machine. Open WebUI automatically detects all models registered in Ollama, so switching between the 12B and 31B variants is a dropdown selection in the interface.</p>
<p><strong>Prerequisites:</strong> Docker Desktop (Mac/Windows) or Docker Engine (Linux).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Pull and start Open WebUI with Ollama auto-detection</span>
</span></span><span style="display:flex;"><span>docker run -d <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -p 3000:8080 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --add-host<span style="color:#f92672">=</span>host.docker.internal:host-gateway <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -v open-webui:/app/backend/data <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --name open-webui <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --restart always <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  ghcr.io/open-webui/open-webui:main
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Access at: http://localhost:3000</span>
</span></span></code></pre></div><p>On first launch, create an admin account (local only — no external services involved). Under Settings → Models, Gemma 4 31B should appear automatically if Ollama is running. Select it as the default model and start chatting.</p>
<h2 id="using-the-gemma-4-31b-local-api-openai-compatible">Using the Gemma 4 31B Local API (OpenAI-Compatible)</h2>
<p>Ollama exposes an OpenAI-compatible REST API at <code>http://localhost:11434/v1</code>, allowing any tool or application that supports the OpenAI SDK to use Gemma 4 31B as a drop-in replacement. This means you can point VS Code extensions like Continue, Python scripts using the <code>openai</code> library, or LangChain pipelines directly at your local Gemma 4 31B instance without modifying code — just change the base URL and set the API key to any non-empty string (Ollama ignores it but the SDK requires a value). This makes Gemma 4 31B an immediately usable private coding assistant with zero monthly cost, zero rate limits, and no data ever leaving your machine.</p>
<p><strong>Python (OpenAI SDK):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI(
</span></span><span style="display:flex;"><span>    base_url<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;http://localhost:11434/v1&#34;</span>,
</span></span><span style="display:flex;"><span>    api_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;ollama&#34;</span>,  <span style="color:#75715e"># required but ignored by Ollama</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gemma4:31b&#34;</span>,
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;Review this Python function for bugs: def parse(x): return x[&#39;data&#39;]&#34;</span>}
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>print(response<span style="color:#f92672">.</span>choices[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>message<span style="color:#f92672">.</span>content)
</span></span></code></pre></div><p><strong>curl:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/v1/chat/completions <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;model&#34;: &#34;gemma4:31b&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;messages&#34;: [{&#34;role&#34;: &#34;user&#34;, &#34;content&#34;: &#34;Hello, Gemma 4!&#34;}]
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">  }&#39;</span>
</span></span></code></pre></div><p><strong>Native Ollama API (also available):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/generate <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{&#34;model&#34;: &#34;gemma4:31b&#34;, &#34;prompt&#34;: &#34;Explain gradient descent&#34;}&#39;</span>
</span></span></code></pre></div><h3 id="streaming-responses">Streaming Responses</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>stream <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gemma4:31b&#34;</span>,
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;Write a FastAPI endpoint for user authentication&#34;</span>}],
</span></span><span style="display:flex;"><span>    stream<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> chunk <span style="color:#f92672">in</span> stream:
</span></span><span style="display:flex;"><span>    print(chunk<span style="color:#f92672">.</span>choices[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>delta<span style="color:#f92672">.</span>content <span style="color:#f92672">or</span> <span style="color:#e6db74">&#34;&#34;</span>, end<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;&#34;</span>)
</span></span></code></pre></div><h2 id="gemma-4-31b-benchmarks-how-it-stacks-up-against-gpt-4o-and-llama-4">Gemma 4 31B Benchmarks: How It Stacks Up Against GPT-4o and Llama 4</h2>
<p>Gemma 4 31B Dense achieves state-of-the-art results for its parameter count, posting 87.1% on MMLU versus GPT-4o&rsquo;s 86.5% — a meaningful reversal given the cost difference (free vs. API pricing). On GPQA Diamond, a graduate-level science benchmark that measures genuine reasoning depth, Gemma 4 31B scores 84.3%, compared to Llama 4 Scout&rsquo;s 74.3% despite Scout having a 109B total parameter count. The AIME 2026 score of 89.2% places it among the top tier of math-capable models available to run without an API. As of April 2026, Gemma 4 31B ranks #3 on the Chatbot Arena (LMSYS) leaderboard — the only fully open model in the top five. This makes it the strongest option for teams that need GPT-4o-class reasoning performance in an air-gapped or privacy-first deployment.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Gemma 4 31B</th>
          <th>GPT-4o</th>
          <th>Llama 4 Scout (109B MoE)</th>
          <th>Claude 3.7 Sonnet</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MMLU</td>
          <td>87.1%</td>
          <td>86.5%</td>
          <td>83.2%</td>
          <td>88.3%</td>
      </tr>
      <tr>
          <td>GPQA Diamond</td>
          <td>84.3%</td>
          <td>83.4%</td>
          <td>74.3%</td>
          <td>84.8%</td>
      </tr>
      <tr>
          <td>AIME 2026</td>
          <td>89.2%</td>
          <td>83.1%</td>
          <td>67.4%</td>
          <td>86.5%</td>
      </tr>
      <tr>
          <td>HumanEval</td>
          <td>85.4%</td>
          <td>87.0%</td>
          <td>79.3%</td>
          <td>86.1%</td>
      </tr>
      <tr>
          <td>Arena Rank</td>
          <td>#3</td>
          <td>#2</td>
          <td>#7</td>
          <td>#1</td>
      </tr>
  </tbody>
</table>
<p><em>Benchmarks sourced from Google DeepMind release notes and third-party evaluations, April–May 2026.</em></p>
<h2 id="optimization-tips-quantization-gpu-layers-and-context-window-tuning">Optimization Tips: Quantization, GPU Layers, and Context Window Tuning</h2>
<p>Ollama&rsquo;s default Q4_K_M quantization is the right choice for most users, reducing VRAM usage by 55–60% versus FP16 with minimal quality degradation. But beyond quantization format, there are several settings worth tuning to maximize performance on your specific hardware. The most impactful variable is GPU layer offloading (<code>num_gpu</code>) — Ollama automatically offloads as many layers as fit in VRAM, but you can override this with an <code>Modelfile</code>. Context window size (<code>num_ctx</code>) also directly affects VRAM usage: Gemma 4 31B supports 256K tokens, but setting a 4K or 8K context for coding tasks frees significant memory for additional parallel requests.</p>
<p><strong>Create a custom Modelfile for tuned inference:</strong></p>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 392 169"
      >
      <g transform='translate(8,16)'>
<circle cx='216' cy='80' r='6' stroke='currentColor' fill='#fff'></circle>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>F</text>
<text text-anchor='middle' x='0' y='36' fill='currentColor' style='font-size:1em'>#</text>
<text text-anchor='middle' x='0' y='52' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='0' y='84' fill='currentColor' style='font-size:1em'>#</text>
<text text-anchor='middle' x='0' y='100' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='0' y='132' fill='currentColor' style='font-size:1em'>#</text>
<text text-anchor='middle' x='0' y='148' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='8' y='52' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='8' y='100' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='8' y='148' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='16' y='36' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='16' y='52' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='16' y='84' fill='currentColor' style='font-size:1em'>F</text>
<text text-anchor='middle' x='16' y='100' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='16' y='132' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='16' y='148' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='24' y='52' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='24' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='24' y='100' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='24' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='24' y='148' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='32' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='32' y='52' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='32' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='32' y='100' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='32' y='132' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='32' y='148' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='40' y='52' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='40' y='84' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='40' y='100' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='40' y='132' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='40' y='148' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='48' y='36' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='48' y='52' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='48' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='48' y='100' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='48' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='48' y='148' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='56' y='36' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='56' y='52' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='56' y='100' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='56' y='132' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='56' y='148' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='64' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='64' y='52' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='64' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='64' y='100' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='64' y='132' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='64' y='148' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='72' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='72' y='84' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='72' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='80' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='80' y='84' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='80' y='100' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='80' y='132' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='80' y='148' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='88' y='36' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='88' y='52' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='88' y='100' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='88' y='132' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='88' y='148' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'>3</text>
<text text-anchor='middle' x='96' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='96' y='52' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='96' y='84' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='96' y='100' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='96' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='96' y='148' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='104' y='52' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='104' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='104' y='100' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='104' y='148' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='112' y='36' fill='currentColor' style='font-size:1em'>w</text>
<text text-anchor='middle' x='112' y='52' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='112' y='84' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='112' y='100' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='112' y='132' fill='currentColor' style='font-size:1em'>f</text>
<text text-anchor='middle' x='112' y='148' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='120' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='120' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='120' y='100' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='120' y='132' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='120' y='148' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='128' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='128' y='52' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='128' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='128' y='100' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='128' y='132' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='128' y='148' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='136' y='36' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='136' y='84' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='136' y='148' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='144' y='36' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='144' y='52' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='144' y='100' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='144' y='132' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='144' y='148' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='152' y='36' fill='currentColor' style='font-size:1em'>w</text>
<text text-anchor='middle' x='152' y='52' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='152' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='152' y='100' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='152' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='152' y='148' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='160' y='52' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='160' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='160' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='160' y='148' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='168' y='36' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='168' y='52' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='168' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='176' y='36' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='176' y='84' fill='currentColor' style='font-size:1em'>G</text>
<text text-anchor='middle' x='176' y='132' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='176' y='148' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='184' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='184' y='84' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='184' y='132' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='184' y='148' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='192' y='36' fill='currentColor' style='font-size:1em'>f</text>
<text text-anchor='middle' x='192' y='84' fill='currentColor' style='font-size:1em'>U</text>
<text text-anchor='middle' x='192' y='132' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='192' y='148' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='200' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='200' y='132' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='208' y='36' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='208' y='132' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='216' y='36' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='216' y='132' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='224' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='224' y='84' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='224' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='232' y='36' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='232' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='232' y='132' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='240' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='240' y='132' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='248' y='36' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='248' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='256' y='36' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='256' y='84' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='256' y='132' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='264' y='36' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='264' y='84' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='264' y='132' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='272' y='36' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='272' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='272' y='132' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='280' y='36' fill='currentColor' style='font-size:1em'>,</text>
<text text-anchor='middle' x='280' y='132' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='288' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='288' y='132' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='296' y='36' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='296' y='84' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='296' y='132' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='304' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='304' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='312' y='36' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='312' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='312' y='132' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='320' y='36' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='320' y='84' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='320' y='132' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='328' y='84' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='328' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='336' y='36' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='336' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='336' y='132' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='344' y='36' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='344' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='344' y='132' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='352' y='36' fill='currentColor' style='font-size:1em'>6</text>
<text text-anchor='middle' x='352' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='352' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='360' y='36' fill='currentColor' style='font-size:1em'>K</text>
<text text-anchor='middle' x='360' y='84' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='368' y='36' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='368' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='376' y='84' fill='currentColor' style='font-size:1em'>)</text>
</g>

    </svg>
  
</div>
<p><strong>Build and run the custom model:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama create gemma4-coding -f Modelfile
</span></span><span style="display:flex;"><span>ollama run gemma4-coding
</span></span></code></pre></div><p><strong>Quantization options and trade-offs:</strong></p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>VRAM (31B)</th>
          <th>Quality</th>
          <th>Speed</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FP16</td>
          <td>~62 GB</td>
          <td>Best</td>
          <td>Fastest per token</td>
      </tr>
      <tr>
          <td>Q8_0</td>
          <td>~33 GB</td>
          <td>Near-lossless</td>
          <td>Fast</td>
      </tr>
      <tr>
          <td>Q4_K_M</td>
          <td>~18–24 GB</td>
          <td>Good (default)</td>
          <td>Good</td>
      </tr>
      <tr>
          <td>Q4_0</td>
          <td>~17 GB</td>
          <td>Slightly lower</td>
          <td>Slightly faster</td>
      </tr>
      <tr>
          <td>Q3_K_M</td>
          <td>~14 GB</td>
          <td>Acceptable</td>
          <td>Fast on low VRAM</td>
      </tr>
  </tbody>
</table>
<h3 id="monitoring-gpu-utilization">Monitoring GPU Utilization</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># NVIDIA</span>
</span></span><span style="display:flex;"><span>watch -n <span style="color:#ae81ff">1</span> nvidia-smi
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Mac (using powermetrics or mactop)</span>
</span></span><span style="display:flex;"><span>mactop
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Check Ollama model status</span>
</span></span><span style="display:flex;"><span>ollama ps
</span></span></code></pre></div><h2 id="common-errors-and-fixes-when-running-gemma-4-31b-locally">Common Errors and Fixes When Running Gemma 4 31B Locally</h2>
<p>Most failures when running Gemma 4 31B locally fall into three categories: insufficient VRAM causing OOM errors, Ollama version mismatches that predate Gemma 4 support, and port conflicts preventing the API from starting. These are all straightforward to diagnose and fix — Ollama&rsquo;s error messages are specific enough to point directly to the root cause in most cases. The most common mistake is attempting to run the 31B model on a GPU with less than 20GB VRAM without adjusting quantization. The second most common is running Ollama 0.4.x, which predates the <code>gemma4</code> model tag and returns a &ldquo;model not found&rdquo; error regardless of what you pull.</p>
<p><strong>Error: <code>CUDA out of memory</code> or <code>error: model requires more system memory</code></strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Check available VRAM</span>
</span></span><span style="display:flex;"><span>nvidia-smi --query-gpu<span style="color:#f92672">=</span>memory.free,memory.total --format<span style="color:#f92672">=</span>csv
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Solution: Force a lower quantization by pulling a specific GGUF tag</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:31b-q3_k_m  <span style="color:#75715e"># ~14GB VRAM</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Or switch to the 26B model</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:26b
</span></span></code></pre></div><p><strong>Error: <code>model &quot;gemma4:31b&quot; not found</code></strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Check Ollama version (needs 0.5+)</span>
</span></span><span style="display:flex;"><span>ollama --version
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Update Ollama</span>
</span></span><span style="display:flex;"><span>curl -fsSL https://ollama.com/install.sh | sh  <span style="color:#75715e"># Linux</span>
</span></span><span style="display:flex;"><span>brew upgrade ollama  <span style="color:#75715e"># Mac</span>
</span></span></code></pre></div><p><strong>Error: <code>listen tcp :11434: bind: address already in use</code></strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Another process is using port 11434</span>
</span></span><span style="display:flex;"><span>lsof -i :11434
</span></span><span style="display:flex;"><span>kill -9 &lt;PID&gt;
</span></span><span style="display:flex;"><span>ollama serve
</span></span></code></pre></div><p><strong>Slow generation speed (&lt; 5 tok/s on RTX 4090)</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Verify GPU is being used, not CPU</span>
</span></span><span style="display:flex;"><span>ollama ps  <span style="color:#75715e"># shows active model and runner type</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># If showing &#34;cpu&#34; runner, CUDA drivers may not be detected</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Reinstall with CUDA drivers:</span>
</span></span><span style="display:flex;"><span>OLLAMA_SKIP_GPU<span style="color:#f92672">=</span>false ollama serve
</span></span></code></pre></div><p><strong>Model loads but produces garbage output</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Corrupted model file — re-pull</span>
</span></span><span style="display:flex;"><span>ollama rm gemma4:31b
</span></span><span style="display:flex;"><span>ollama pull gemma4:31b
</span></span></code></pre></div><hr>
<h2 id="faq">FAQ</h2>
<p>The following questions address the most common issues and misconceptions when setting up Gemma 4 31B locally with Ollama. Hardware compatibility is the most frequent stumbling block — specifically the gap between a model&rsquo;s FP16 memory footprint and its quantized footprint. Gemma 4 31B at Q4_K_M requires roughly 18–24GB VRAM, not the 62GB you would need for FP16, which changes the hardware requirements dramatically. Other common points of confusion include the model variant naming (no &ldquo;27B&rdquo; variant exists in Gemma 4), offline operation capabilities (the model runs entirely air-gapped after the initial download completes), CPU fallback behavior when no compatible GPU is present, and licensing terms for commercial deployments. The Apache 2.0 license makes Gemma 4 31B fully usable in production environments without royalties or usage restrictions, which distinguishes it from some other open-weight models with more restrictive non-commercial terms.</p>
<h3 id="does-gemma-4-31b-require-an-internet-connection-after-download">Does Gemma 4 31B require an internet connection after download?</h3>
<p>No. Once <code>ollama pull gemma4:31b</code> completes, the model runs entirely offline. Ollama stores the weights in <code>~/.ollama/models/</code> and inference happens locally with no network calls. You can disconnect your machine from the internet and the model continues to work normally.</p>
<h3 id="can-i-run-gemma-4-31b-on-a-cpu-without-a-gpu">Can I run Gemma 4 31B on a CPU without a GPU?</h3>
<p>Yes, but it will be very slow. Ollama falls back to CPU inference automatically if no compatible GPU is detected. Expect 1–3 tokens per second on a modern desktop CPU versus 30–60+ tokens per second on an RTX 4090. For practical use, a GPU with at least 20GB VRAM is strongly recommended for the 31B variant.</p>
<h3 id="what-is-the-difference-between-gemma-4-31b-and-gemma-4-27b">What is the difference between Gemma 4 31B and Gemma 4 27B?</h3>
<p>There is no official &ldquo;27B&rdquo; variant in the Gemma 4 family. The lineup is E2B, E4B, 12B, 26B, and 31B. Some confusion arises because earlier Gemma 2 had a 27B model. Gemma 4 31B is the top-tier dense model in the current release.</p>
<h3 id="how-do-i-update-gemma-4-to-a-newer-version-when-google-releases-one">How do I update Gemma 4 to a newer version when Google releases one?</h3>
<p>Run <code>ollama pull gemma4:31b</code> again. Ollama checks the registry for a newer manifest and downloads only the changed layers if an update is available. You can also use <code>ollama pull gemma4:latest</code> to always fetch the most recent Gemma 4 variant automatically.</p>
<h3 id="is-gemma-4-31b-safe-to-use-in-production-with-real-user-data">Is Gemma 4 31B safe to use in production with real user data?</h3>
<p>Gemma 4 31B is Apache 2.0 licensed, so commercial use is permitted without restriction. For production deployments handling sensitive user data, running it locally with Ollama is actually the privacy-correct approach — no data is sent to third-party servers. However, like all language models, it can produce hallucinations and should not be used for safety-critical decisions without human review and output validation.</p>
]]></content:encoded></item><item><title>Gemma 4 Review 2026: Google's Best Open-Source Model Yet?</title><link>https://baeseokjae.github.io/posts/gemma-4-review-2026/</link><pubDate>Thu, 07 May 2026 03:04:21 +0000</pubDate><guid>https://baeseokjae.github.io/posts/gemma-4-review-2026/</guid><description>Gemma 4 review: benchmarks, model variants, Apache 2.0 license, and how it stacks up against Llama 4, GPT-4, and Claude in 2026.</description><content:encoded><![CDATA[<p>Gemma 4 is Google DeepMind&rsquo;s 2026 open-source model family — four model sizes from 2B (phone-optimized) to 31B dense, all under Apache 2.0, scoring 89.2% on AIME 2026 and ranking #3 on the Arena AI leaderboard. If you&rsquo;re evaluating open-weight models for production use today, Gemma 4 is the most commercially viable and technically competitive option available.</p>
<h2 id="what-is-gemma-4-googles-open-source-flagship-explained">What Is Gemma 4? Google&rsquo;s Open-Source Flagship Explained</h2>
<p>Gemma 4 is Google DeepMind&rsquo;s fourth-generation open-weight language model family, released on April 2, 2026, designed to cover the full deployment spectrum — from on-device inference on smartphones to large-scale server workloads. Unlike prior Gemma generations, Gemma 4 ships with genuine frontier-model performance: the 31B dense variant scores 84.3% on GPQA Diamond, outperforming Meta&rsquo;s Llama 4 Scout (109B) at 74.3%, and reaching 89.2% on the AIME 2026 math benchmark — a figure that was 20.8% just one generation earlier. The model family is multimodal (vision + audio input on edge models), multilingual (140+ languages), and supports context windows up to 256K tokens. Since Google&rsquo;s first Gemma release, developers have downloaded Gemma models over 400 million times, and the Gemmaverse now includes over 100,000 community-created fine-tunes and variants. That ecosystem depth means production-grade LoRA adapters, GGUF quants, and tool integrations are available day one — not months later. Gemma 4 is the model to benchmark any other open-weight model against in 2026.</p>
<h2 id="gemma-4-model-variants-e2b-e4b-26b-moe-and-31b-dense">Gemma 4 Model Variants: E2B, E4B, 26B MoE, and 31B Dense</h2>
<p>Gemma 4 ships as four distinct model sizes, each targeting a different hardware tier. The E2B (2B parameters) and E4B (4B parameters) are edge-optimized models built for mobile, IoT, and Raspberry Pi — the E2B achieves 3,700 prefill and 31 decode tokens per second on a Qualcomm Dragonwing IQ8 NPU, making real-time on-device inference viable for the first time in a frontier-class model family. Both edge variants support 128K context and multimodal input including audio. The 26B Mixture-of-Experts (MoE) model activates a fraction of its total parameters per forward pass, offering a better compute-per-quality tradeoff for mid-tier GPU servers — it ranks #6 on the Arena AI text leaderboard. The 31B Dense model is the flagship, activating all 31 billion parameters on each pass and delivering the best single-model quality of the family; it holds Arena AI #3 and beats models three to ten times its parameter count in benchmark-to-benchmark comparisons. All four models are distributed under Apache 2.0 with no maximum active user (MAU) restrictions, making them drop-in replacements for proprietary APIs in commercial products.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Parameters</th>
          <th>Context</th>
          <th>Best For</th>
          <th>Arena Rank</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>E2B</td>
          <td>2B</td>
          <td>128K</td>
          <td>Mobile / IoT</td>
          <td>—</td>
      </tr>
      <tr>
          <td>E4B</td>
          <td>4B</td>
          <td>128K</td>
          <td>Edge servers / Raspberry Pi</td>
          <td>—</td>
      </tr>
      <tr>
          <td>26B MoE</td>
          <td>26B active</td>
          <td>256K</td>
          <td>Mid-tier GPU workloads</td>
          <td>#6</td>
      </tr>
      <tr>
          <td>31B Dense</td>
          <td>31B</td>
          <td>256K</td>
          <td>Best quality, production API</td>
          <td>#3</td>
      </tr>
  </tbody>
</table>
<h2 id="key-features--multimodal-multilingual-and-256k-context">Key Features — Multimodal, Multilingual, and 256K Context</h2>
<p>Gemma 4 is the first Gemma generation to treat multimodality and multilingualism as first-class features rather than add-ons. The model was natively trained on over 140 languages — not post-trained via translation alignment — which means it generalizes better to low-resource languages like Swahili or Tagalog without the performance cliff common in English-centric models. Larger variants (26B MoE and 31B Dense) support a 256K token context window, enabling full-book RAG, multi-file code analysis, and long-form document summarization without chunking. Edge variants (E2B, E4B) handle images and audio as input, useful for mobile applications that need a local vision-language model without cloud round-trips. The model supports structured output modes (JSON schema enforcement), tool calling, and an agentic execution format compatible with LangChain, LlamaIndex, and Google&rsquo;s own Agent Development Kit (ADK). Practically speaking, this means Gemma 4 slots directly into existing LLM pipelines — you can swap a Gemini or GPT-4 API call for a self-hosted Gemma 4 endpoint with minimal prompt engineering changes.</p>
<h3 id="256k-context-in-practice">256K Context in Practice</h3>
<p>The 256K context window means you can feed a full codebase, a legal contract library, or a year&rsquo;s worth of customer support tickets in a single prompt. In practice, retrieval quality on long contexts degrades less than GPT-4 Turbo in the 100K–200K range based on &ldquo;lost in the middle&rdquo; evaluations — Gemma 4 maintains retrieval accuracy at 82% at the 200K position vs GPT-4 Turbo&rsquo;s 71%. That&rsquo;s a meaningful difference for RAG-heavy applications where context length isn&rsquo;t just a checkbox.</p>
<h2 id="gemma-4-benchmark-results-how-good-is-it-really">Gemma 4 Benchmark Results: How Good Is It Really?</h2>
<p>Gemma 4&rsquo;s benchmark numbers represent the largest single-generation leap in the open-weight model ecosystem since the original Llama 2 release. On AIME 2026 (college-level math olympiad), the 31B model scores 89.2% — compared to Gemma 3&rsquo;s 20.8%, that&rsquo;s a 68-point jump in one generation. On LiveCodeBench v6 (competitive coding), Gemma 4 scores 80.0% vs 29.1% for Gemma 3 and 77.1% for Llama 4. On Codeforces ELO (programming contest simulation), the model went from 110 to 2,150 — moving from hobbyist-level to expert competitive programmer. MMLU (broad knowledge across 57 subjects) comes in at 87.1%, beating GPT-4&rsquo;s 86.5% while running entirely on local hardware at zero marginal API cost. GPQA Diamond (doctoral-level science questions) sits at 84.3%, a 10-point lead over Llama 4 Scout. These aren&rsquo;t cherry-picked metrics — Gemma 4&rsquo;s gains are consistent across math, science, coding, and language tasks.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Gemma 4 31B</th>
          <th>Gemma 3</th>
          <th>Llama 4 Scout</th>
          <th>GPT-4</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AIME 2026</td>
          <td><strong>89.2%</strong></td>
          <td>20.8%</td>
          <td>~75%</td>
          <td>~72%</td>
      </tr>
      <tr>
          <td>LiveCodeBench v6</td>
          <td><strong>80.0%</strong></td>
          <td>29.1%</td>
          <td>77.1%</td>
          <td>~74%</td>
      </tr>
      <tr>
          <td>GPQA Diamond</td>
          <td><strong>84.3%</strong></td>
          <td>—</td>
          <td>74.3%</td>
          <td>79.4%</td>
      </tr>
      <tr>
          <td>MMLU</td>
          <td><strong>87.1%</strong></td>
          <td>—</td>
          <td>~82%</td>
          <td>86.5%</td>
      </tr>
      <tr>
          <td>Codeforces ELO</td>
          <td><strong>2,150</strong></td>
          <td>110</td>
          <td>~1,900</td>
          <td>—</td>
      </tr>
  </tbody>
</table>
<h3 id="whats-behind-the-gemma-3--gemma-4-leap">What&rsquo;s Behind the Gemma 3 → Gemma 4 Leap?</h3>
<p>The jump from 20.8% to 89.2% AIME isn&rsquo;t mysterious — Google invested heavily in two areas: chain-of-thought alignment using reinforcement learning from verifiable rewards (RLVR), and synthetic math data generation at scale. The same approach drove similar gains in Gemini 2.0 Flash Thinking. Essentially, Google solved the same problem OpenAI solved with o1, then distilled the reasoning capability into an open-weight model available to anyone with a GPU.</p>
<h2 id="gemma-4-vs-llama-4-vs-gpt-4-vs-claude--who-wins">Gemma 4 vs Llama 4 vs GPT-4 vs Claude — Who Wins?</h2>
<p>Gemma 4 is the most competitive open-weight model in 2026, but &ldquo;wins&rdquo; depends heavily on the task and your deployment constraints. Against Llama 4 Scout (109B, Meta&rsquo;s midrange model), Gemma 4 31B is smaller, faster to serve, and scores higher on every benchmark listed above — while Llama 4 has a 10M MAU commercial restriction, Gemma 4 has none. Against GPT-4, Gemma 4 31B matches or slightly exceeds performance on most benchmarks while costing nothing in API fees if self-hosted. The caveat: GPT-4 has better tooling, broader third-party integration, and no self-hosting burden. Against Claude 3.5 Sonnet, Gemma 4 trails on multi-step reasoning chains and creative writing tasks but is competitive on coding and factual recall. Against Qwen 3.5 27B (the strongest China-origin open model), Gemma 4 loses on SWE-bench Verified — Qwen&rsquo;s software engineering performance is currently superior — but Gemma 4 leads on multilingual tasks and edge deployment options.</p>
<table>
  <thead>
      <tr>
          <th>Use Case</th>
          <th>Winner</th>
          <th>Why</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>On-device / mobile</td>
          <td><strong>Gemma 4 E2B/E4B</strong></td>
          <td>Only frontier-grade model optimized for NPUs</td>
      </tr>
      <tr>
          <td>Math / science reasoning</td>
          <td><strong>Gemma 4 31B</strong></td>
          <td>89.2% AIME, 84.3% GPQA</td>
      </tr>
      <tr>
          <td>Software engineering tasks</td>
          <td><strong>Qwen 3.5 27B</strong></td>
          <td>Higher SWE-bench Verified score</td>
      </tr>
      <tr>
          <td>No-restriction commercial use</td>
          <td><strong>Gemma 4</strong></td>
          <td>Apache 2.0, no MAU cap</td>
      </tr>
      <tr>
          <td>Least operational burden</td>
          <td><strong>GPT-4 / Claude</strong></td>
          <td>No self-hosting needed</td>
      </tr>
      <tr>
          <td>Multilingual NLP</td>
          <td><strong>Gemma 4</strong></td>
          <td>140+ natively trained languages</td>
      </tr>
  </tbody>
</table>
<h2 id="on-device-and-edge-deployment-running-gemma-4-locally">On-Device and Edge Deployment: Running Gemma 4 Locally</h2>
<p>Gemma 4 is the only open model family in 2026 that genuinely spans from phones to data center servers under a single Apache 2.0 license. On a Qualcomm Dragonwing IQ8 NPU, the E2B model achieves 3,700 prefill tokens per second and 31 decode tokens per second — fast enough for real-time chat, live transcription assistance, and local document QA without cloud round-trips. On a MacBook Pro M3 with 36GB unified memory, the 31B dense model runs at approximately 25 tokens per second with llama.cpp&rsquo;s Metal backend, making it comfortable for developer use. On an NVIDIA RTX 4090 (24GB VRAM), the 31B model fits in 4-bit quantization and runs at ~55 tokens per second, suitable for local API servers. Day-one support spans Hugging Face Transformers, Ollama, vLLM, llama.cpp, and NVIDIA NIM — no custom inference infrastructure is required. For privacy-sensitive applications (healthcare, legal, finance), the ability to run a GPT-4-class model with zero data leaving the premises is the decisive factor, and Gemma 4 is the only model family that delivers this at every hardware tier.</p>
<h3 id="quick-start-with-ollama">Quick Start with Ollama</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Pull and run Gemma 4 31B locally</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:31b
</span></span><span style="display:flex;"><span>ollama run gemma4:31b <span style="color:#e6db74">&#34;Explain quantum entanglement in 3 sentences&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Edge model for Raspberry Pi / low-memory devices</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:e4b
</span></span><span style="display:flex;"><span>ollama run gemma4:e4b
</span></span></code></pre></div><p>The E4B variant runs on 8GB RAM, making it viable on a Raspberry Pi 5 or any machine with 8GB+ of memory.</p>
<h2 id="apache-20-license--why-it-matters-for-developers-and-enterprises">Apache 2.0 License — Why It Matters for Developers and Enterprises</h2>
<p>Apache 2.0 is the gold standard for open-source commercial use, and Gemma 4&rsquo;s adoption of it without any active user restrictions is the most commercially significant licensing decision in the open-weight model space since Falcon&rsquo;s MIT release. Meta&rsquo;s Llama 4 license caps commercial use at 700 million monthly active users — a restriction that affects only a handful of companies today but signals Meta&rsquo;s intent to extract licensing revenue as models become infrastructure. Mistral&rsquo;s licenses have historically included use-case carve-outs. Gemma 4 imposes none of these restrictions. You can build a commercial product, embed it in enterprise software, redistribute model weights, and fine-tune for any vertical without royalty payments, revenue share, or usage caps. For startups especially, this matters: you&rsquo;re not betting your product&rsquo;s legal foundation on a company&rsquo;s continued goodwill or future license amendments. For enterprises with legal teams that require OSI-approved licenses for vendor dependency review, Apache 2.0 is the only answer — and Gemma 4 is the best-performing Apache 2.0 model available in 2026. The Gemmaverse&rsquo;s 100,000+ community variants also mean that if you need a fine-tuned model for your vertical (medical, legal, code), there&rsquo;s almost certainly an Apache 2.0 derivative already available on Hugging Face.</p>
<h2 id="gemma-4-limitations-and-weaknesses-you-should-know">Gemma 4 Limitations and Weaknesses You Should Know</h2>
<p>Gemma 4 is the best open-weight model in 2026, but it has real limitations that should inform deployment decisions. First, there is no native speech output — the E2B and E4B models accept audio input but cannot generate audio, requiring a separate TTS pipeline for voice applications. Second, the model has a fixed knowledge cutoff with no internet access; for applications requiring real-time information retrieval, you&rsquo;ll need to wire up a RAG pipeline or tool-call layer. Third, self-hosting shifts operational responsibility to you: fine-tuning, weight management, serving infrastructure, uptime, and security are all your problem. That&rsquo;s valuable for privacy and cost at scale, but it&rsquo;s a meaningful engineering overhead compared to a managed API. Fourth, on SWE-bench Verified (real-world software engineering tasks), Gemma 4 trails Qwen 3.5 27B — if software engineering automation is your primary use case, Qwen deserves evaluation. Fifth, while Codeforces ELO is strong at 2,150, complex multi-file refactoring and codebase-level reasoning remain areas where Claude 3.7 Sonnet and GPT-4.1 pull ahead. These are real tradeoffs, not dealbreakers — but understanding them prevents over-application of the model.</p>
<h3 id="known-limitations-summary">Known Limitations Summary</h3>
<ul>
<li>No audio output (input only on E2B/E4B)</li>
<li>Fixed knowledge cutoff, no web access</li>
<li>Self-hosting burden: infra, updates, and security are on you</li>
<li>Trails Qwen 3.5 27B on SWE-bench Verified</li>
<li>Complex multi-file refactoring: Claude 3.7 Sonnet still leads</li>
</ul>
<h2 id="who-should-use-gemma-4-practical-recommendations">Who Should Use Gemma 4? Practical Recommendations</h2>
<p>Gemma 4 is the right choice for four specific developer and enterprise profiles, and the wrong choice for two others. If you are building mobile or edge AI applications, Gemma 4 E2B/E4B is the only production-grade option — no other frontier model family runs on Qualcomm NPUs at 3,700 tokens/second. If you are building privacy-sensitive applications in healthcare, legal, or finance where data cannot leave your infrastructure, the 31B dense model delivers GPT-4-class performance with zero cloud dependency. If you are a startup or enterprise that needs Apache 2.0 with no user caps, Gemma 4 is the only frontier model that qualifies. If you need strong multilingual support for 140+ languages, Gemma 4&rsquo;s native language training beats every other open-weight alternative. Gemma 4 is the wrong choice if you need zero operational overhead — in that case, the managed Claude or GPT-4 APIs are simpler. It&rsquo;s also the wrong first choice if software engineering automation (automated code review, PR generation, issue resolution) is your core use case; benchmark Qwen 3.5 27B alongside Gemma 4 before committing.</p>
<p><strong>Recommended for:</strong></p>
<ul>
<li>Mobile / IoT / edge AI deployments</li>
<li>Privacy-first applications (HIPAA, GDPR, finance)</li>
<li>Commercial products needing Apache 2.0 at any scale</li>
<li>Multilingual NLP applications</li>
<li>Math, science, and coding assistants</li>
</ul>
<p><strong>Consider alternatives for:</strong></p>
<ul>
<li>Automated software engineering (evaluate Qwen 3.5 27B)</li>
<li>Zero-infrastructure managed API (use Claude or GPT-4)</li>
</ul>
<h2 id="final-verdict-is-gemma-4-googles-best-open-source-model-yet">Final Verdict: Is Gemma 4 Google&rsquo;s Best Open-Source Model Yet?</h2>
<p>Gemma 4 is definitively Google&rsquo;s best open-source model and the strongest open-weight model family released in 2026. The combination of 89.2% AIME performance, Arena AI #3 ranking, a 256K context window, genuine edge deployment to phones and IoT devices, and an unrestricted Apache 2.0 license has no equivalent in the open-weight ecosystem. The Gemma 3 → Gemma 4 leap — driven by RLVR training and synthetic reasoning data — demonstrates that Google has solved the reasoning gap that made Gemma 3 a second-tier choice. The 400M+ download history and 100,000+ community variants mean production infrastructure, tooling, and domain-specific fine-tunes exist now. If you were waiting for an open-weight model that could realistically replace a proprietary API for most production workloads, Gemma 4 is that model. The primary caveat is operational: self-hosting is still non-trivial, and for teams without ML infrastructure expertise, the managed API path remains more practical despite the cost and privacy tradeoffs. But for developers and enterprises who have made the infrastructure investment, Gemma 4 is the model to run in 2026.</p>
<p><strong>Bottom line:</strong> If you&rsquo;re evaluating open-weight models today, start with Gemma 4 31B. It outperforms everything at its parameter count, holds a license that never expires or changes, and runs on hardware you probably already have.</p>
<hr>
<h2 id="faq">FAQ</h2>
<p><strong>Is Gemma 4 free to use commercially?</strong>
Yes. Gemma 4 is released under Apache 2.0 with no active user caps, no revenue share, and no royalty requirements. You can build and ship commercial products using Gemma 4 weights without any licensing fees or usage restrictions.</p>
<p><strong>How does Gemma 4 compare to Llama 4?</strong>
Gemma 4 31B outperforms Llama 4 Scout (109B) on GPQA Diamond (84.3% vs 74.3%), LiveCodeBench v6 (80.0% vs 77.1%), and AIME 2026. Gemma 4 also has no MAU commercial restrictions vs Llama 4&rsquo;s 700M MAU cap, and it genuinely supports on-device deployment which Llama 4 does not.</p>
<p><strong>Can Gemma 4 run on a laptop?</strong>
Yes. The E4B model runs on 8GB RAM (laptop-viable), the 26B MoE runs well on a machine with 24GB+ RAM or VRAM, and the 31B Dense runs on a MacBook Pro M3 with 36GB unified memory at ~25 tokens/second with Ollama.</p>
<p><strong>What is Gemma 4&rsquo;s context window?</strong>
The 26B MoE and 31B Dense models support 256K tokens. The edge models (E2B, E4B) support 128K tokens. At 256K, the model can process approximately 200,000 words — roughly three full novels — in a single prompt.</p>
<p><strong>Does Gemma 4 support multimodal inputs?</strong>
Yes. The E2B and E4B edge models accept image and audio inputs. The 26B MoE and 31B Dense models accept image inputs. None of the current Gemma 4 variants generate audio or image outputs — text output only.</p>
]]></content:encoded></item></channel></rss>