<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>AI Setup on RockB</title><link>https://baeseokjae.github.io/tags/ai-setup/</link><description>Recent content in AI Setup on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Thu, 07 May 2026 06:04:01 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/ai-setup/index.xml" rel="self" type="application/rss+xml"/><item><title>Run Gemma 4 Locally in 2026: 31B Dense Setup Guide with Ollama</title><link>https://baeseokjae.github.io/posts/gemma-4-local-setup-guide-2026/</link><pubDate>Thu, 07 May 2026 06:04:01 +0000</pubDate><guid>https://baeseokjae.github.io/posts/gemma-4-local-setup-guide-2026/</guid><description>Step-by-step guide to running Gemma 4 31B Dense locally with Ollama — hardware requirements, installation, Open WebUI, and API usage.</description><content:encoded><![CDATA[<p>Gemma 4 31B Dense runs locally on a single RTX 4090 or Mac M3 Max using Ollama — no API key, no data leaving your machine. Install Ollama, run <code>ollama pull gemma4:31b</code>, and you have a model that scores 87.1% on MMLU, beating GPT-4o&rsquo;s 86.5%, running entirely on your hardware.</p>
<h2 id="what-is-gemma-4-31b-dense-and-why-run-it-locally">What Is Gemma 4 31B Dense and Why Run It Locally?</h2>
<p>Gemma 4 31B Dense is a 31-billion-parameter language model released by Google DeepMind on April 2, 2026, under the Apache 2.0 license. Unlike mixture-of-experts architectures that distribute parameters across sparse expert layers, the 31B Dense model activates all 31 billion parameters on every token — giving it more reliable reasoning depth than larger MoE models with similar active parameter counts. In benchmark testing, Gemma 4 31B scores 87.1% on MMLU (beating GPT-4o&rsquo;s 86.5%), 89.2% on AIME 2026, and 84.3% on GPQA Diamond — outperforming Llama 4 Scout&rsquo;s 109B MoE model on the harder science benchmarks. Running it locally means zero API costs, complete data privacy, no rate limits, and the ability to integrate with any tool via the OpenAI-compatible REST endpoint that Ollama exposes on <code>localhost:11434</code>. For developers, researchers, or privacy-conscious users, this is the highest-performing open model available for on-device inference as of mid-2026.</p>
<h3 id="dense-vs-moe-why-the-architecture-matters-for-local-inference">Dense vs. MoE: Why the Architecture Matters for Local Inference</h3>
<p>A dense model like Gemma 4 31B activates every parameter on every forward pass. An MoE model like Llama 4 Scout (109B total, ~17B active) routes each token through only a subset of expert layers. For local inference, the dense architecture has a decisive advantage: total VRAM needed corresponds directly to the active parameter count. With Q4_K_M quantization — Ollama&rsquo;s default — Gemma 4 31B fits in approximately 24GB VRAM, which is exactly what a single RTX 4090 or RTX 6000 Ada provides. A 109B MoE model at the same quantization still requires routing infrastructure and substantially more memory even if active parameters are lower, making it harder to run on consumer hardware without CPU offloading.</p>
<h2 id="gemma-4-model-variants-e2b-e4b-26b-and-31b-compared">Gemma 4 Model Variants: E2B, E4B, 26B, and 31B Compared</h2>
<p>Gemma 4 ships in four variants with meaningfully different hardware requirements and capability profiles. The E2B (2B Edge) and E4B (4B Edge) models are designed for mobile and embedded deployment — they feature native audio input and a 128K context window, making them unique among the family. The 26B and 31B models target server and workstation deployment, both supporting a 256K token context window and excelling at multi-step reasoning, coding, and mathematics. The 31B Dense specifically is the flagship for local deployment: it is natively trained on over 140 languages, released under Apache 2.0, and achieves GPT-4o-class performance on a single high-end consumer GPU. The choice between variants comes down almost entirely to available VRAM, since quality scales predictably across the lineup.</p>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Active Params</th>
          <th>VRAM (Q4_K_M)</th>
          <th>VRAM (FP16)</th>
          <th>Context</th>
          <th>Best For</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>E2B</td>
          <td>2B</td>
          <td>~1.5 GB</td>
          <td>~4 GB</td>
          <td>128K</td>
          <td>Mobile, edge devices</td>
      </tr>
      <tr>
          <td>E4B</td>
          <td>4B</td>
          <td>~2.8 GB</td>
          <td>~8 GB</td>
          <td>128K</td>
          <td>Laptop CPU/integrated GPU</td>
      </tr>
      <tr>
          <td>12B</td>
          <td>12B</td>
          <td>~6.6 GB</td>
          <td>~24 GB</td>
          <td>128K</td>
          <td>RTX 3060, M2 MacBook</td>
      </tr>
      <tr>
          <td>26B</td>
          <td>26B</td>
          <td>~14 GB</td>
          <td>~52 GB</td>
          <td>256K</td>
          <td>RTX 3090, M3 Pro</td>
      </tr>
      <tr>
          <td>31B</td>
          <td>31B</td>
          <td>~18–24 GB</td>
          <td>~62 GB</td>
          <td>256K</td>
          <td>RTX 4090, M3 Max, M4 Ultra</td>
      </tr>
  </tbody>
</table>
<h3 id="which-variant-should-you-pick">Which Variant Should You Pick?</h3>
<p>If you have 24GB VRAM (RTX 4090, RTX 6000 Ada) or 32GB+ unified memory (M3 Max, M4 Pro/Max), run the 31B. If you have 16GB VRAM (RTX 4080, A4000), run the 26B at Q4_K_M. For anything with 8–12GB VRAM (RTX 3060 12GB, RTX 4060 Ti 16GB), the 12B variant is the correct choice — it requires only 6.6GB VRAM at Q4 quantization and delivers strong coding and reasoning performance. The E2B and E4B are specifically for devices without a discrete GPU.</p>
<h2 id="hardware-requirements-for-gemma-4-31b-vram-ram-cpu">Hardware Requirements for Gemma 4 31B (VRAM, RAM, CPU)</h2>
<p>Gemma 4 31B Dense requires 24GB VRAM at Q4_K_M quantization or 62GB VRAM at full FP16 precision. In practice, Q4_K_M is the correct target for consumer hardware: Ollama defaults to this quantization automatically, reducing memory usage by approximately 55–60% compared to FP16, with only a marginal quality drop that is typically imperceptible in conversational and coding tasks. The minimum viable GPU is a single RTX 4090 (24GB). For Mac users, the M3 Max (36GB or 48GB unified memory) and M4 Pro/Max provide excellent performance because Apple Silicon shares memory between CPU and GPU — you can run the 31B comfortably with 36GB total unified memory. Linux workstations with dual RTX 3090s (24GB each) can also run the 31B by splitting the model across GPUs, though this requires additional configuration and results in slower inference than a single 4090.</p>
<table>
  <thead>
      <tr>
          <th>GPU / Platform</th>
          <th>VRAM / Unified Memory</th>
          <th>Gemma 4 31B (Q4)?</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RTX 4090</td>
          <td>24 GB</td>
          <td>Yes</td>
          <td>Ideal single-GPU setup</td>
      </tr>
      <tr>
          <td>RTX 6000 Ada</td>
          <td>48 GB</td>
          <td>Yes</td>
          <td>Runs FP16 too</td>
      </tr>
      <tr>
          <td>RTX 4080</td>
          <td>16 GB</td>
          <td>No</td>
          <td>Use 26B instead</td>
      </tr>
      <tr>
          <td>RTX 3090 x2</td>
          <td>48 GB total</td>
          <td>Yes</td>
          <td>Slower, split model</td>
      </tr>
      <tr>
          <td>M3 Max 36GB</td>
          <td>36 GB unified</td>
          <td>Yes</td>
          <td>Excellent tok/s</td>
      </tr>
      <tr>
          <td>M4 Max 64GB</td>
          <td>64 GB unified</td>
          <td>Yes</td>
          <td>Can run FP16</td>
      </tr>
      <tr>
          <td>M2 MacBook Pro 16GB</td>
          <td>16 GB unified</td>
          <td>No</td>
          <td>Use 12B instead</td>
      </tr>
  </tbody>
</table>
<p><strong>System RAM:</strong> Ollama also uses system RAM for the context cache. Aim for at least 32GB system RAM when running 31B. CPU doesn&rsquo;t significantly affect generation speed once the model is loaded into VRAM — but fast NVMe SSD storage (PCIe 4.0+) reduces initial model load time from cold.</p>
<h2 id="step-1--install-ollama-on-mac-windows-or-linux">Step 1 — Install Ollama on Mac, Windows, or Linux</h2>
<p>Ollama is the fastest path to running Gemma 4 31B locally, providing a one-command model download, automatic quantization selection, and an OpenAI-compatible REST API out of the box. It abstracts away model sharding, quantization configuration, and the llama.cpp backend — you get a clean CLI and HTTP interface without needing to understand the internals. As of May 2026, Ollama supports CUDA (NVIDIA), ROCm (AMD), Metal (Apple Silicon), and CPU-only inference. Installation is straightforward across all three major operating systems, and the entire setup from zero to running model takes under 10 minutes on a fast internet connection. Ollama version 0.5+ is required for Gemma 4 support — older versions do not have the model architecture registered.</p>
<p><strong>Mac:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>brew install ollama
</span></span><span style="display:flex;"><span><span style="color:#75715e"># or download the .dmg from ollama.com</span>
</span></span><span style="display:flex;"><span>ollama serve  <span style="color:#75715e"># starts the background server</span>
</span></span></code></pre></div><p><strong>Linux:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl -fsSL https://ollama.com/install.sh | sh
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Automatically installs CUDA drivers if NVIDIA GPU detected</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Service starts automatically via systemd</span>
</span></span></code></pre></div><p><strong>Windows:</strong>
Download the installer from <a href="https://ollama.com">ollama.com</a>. The installer configures a background Windows service and adds <code>ollama</code> to PATH. CUDA support requires NVIDIA drivers 525.85+.</p>
<p><strong>Verify the install:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama --version
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Should output: ollama version 0.5.x or higher</span>
</span></span></code></pre></div><h2 id="step-2--pull-and-run-the-gemma-4-31b-model-with-ollama">Step 2 — Pull and Run the Gemma 4 31B Model with Ollama</h2>
<p>Pulling Gemma 4 31B downloads approximately 18–20GB of model weights in Q4_K_M format. Ollama handles quantization and model registration automatically — no manual GGUF conversion or configuration required. The model is pulled from Ollama&rsquo;s model registry, which mirrors the Hugging Face checkpoint in a pre-quantized GGUF format. On a 500 Mbps connection, the download takes roughly 5–7 minutes. Once complete, the model is cached locally in <code>~/.ollama/models/</code> and subsequent loads are instant. The Gemma 4 31B Ollama tag is <code>gemma4:31b</code> — note this differs from the Hugging Face naming convention.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Pull the 31B Dense model (Q4_K_M by default, ~18GB)</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:31b
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Run an interactive chat session</span>
</span></span><span style="display:flex;"><span>ollama run gemma4:31b
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Example prompt after model loads:</span>
</span></span><span style="display:flex;"><span>&gt;&gt;&gt; Explain the difference between dense and MoE transformer architectures.
</span></span></code></pre></div><p><strong>Other variants:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama pull gemma4:2b    <span style="color:#75715e"># E2B edge model</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:4b    <span style="color:#75715e"># E4B edge model</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:12b   <span style="color:#75715e"># 12B standard</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:26b   <span style="color:#75715e"># 26B standard</span>
</span></span></code></pre></div><p><strong>Check which models are installed:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama list
</span></span></code></pre></div><p><strong>Stop a running model session:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># In the chat, press Ctrl+D or type /bye</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># To stop the background Ollama server:</span>
</span></span><span style="display:flex;"><span>ollama stop gemma4:31b
</span></span></code></pre></div><h3 id="running-multiple-prompts-via-the-cli">Running Multiple Prompts via the CLI</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Non-interactive single prompt</span>
</span></span><span style="display:flex;"><span>ollama run gemma4:31b <span style="color:#e6db74">&#34;Write a Python function that parses JSON from a REST API response&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Pipe stdin for batch processing</span>
</span></span><span style="display:flex;"><span>echo <span style="color:#e6db74">&#34;Summarize this text: </span><span style="color:#66d9ef">$(</span>cat document.txt<span style="color:#66d9ef">)</span><span style="color:#e6db74">&#34;</span> | ollama run gemma4:31b
</span></span></code></pre></div><h2 id="step-3--set-up-open-webui-for-a-chatgpt-like-interface">Step 3 — Set Up Open WebUI for a ChatGPT-Like Interface</h2>
<p>Open WebUI is an open-source browser interface that connects directly to Ollama, providing a polished chat experience with conversation history, model switching, file uploads, and system prompt configuration — all running locally. It runs as a Docker container and takes under 2 minutes to set up once Docker is installed. The interface is accessible at <code>http://localhost:3000</code> and supports multiple users, making it useful for team deployments on a local network where a shared Gemma 4 instance is hosted on a single powerful machine. Open WebUI automatically detects all models registered in Ollama, so switching between the 12B and 31B variants is a dropdown selection in the interface.</p>
<p><strong>Prerequisites:</strong> Docker Desktop (Mac/Windows) or Docker Engine (Linux).</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Pull and start Open WebUI with Ollama auto-detection</span>
</span></span><span style="display:flex;"><span>docker run -d <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -p 3000:8080 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --add-host<span style="color:#f92672">=</span>host.docker.internal:host-gateway <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -v open-webui:/app/backend/data <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --name open-webui <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --restart always <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  ghcr.io/open-webui/open-webui:main
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Access at: http://localhost:3000</span>
</span></span></code></pre></div><p>On first launch, create an admin account (local only — no external services involved). Under Settings → Models, Gemma 4 31B should appear automatically if Ollama is running. Select it as the default model and start chatting.</p>
<h2 id="using-the-gemma-4-31b-local-api-openai-compatible">Using the Gemma 4 31B Local API (OpenAI-Compatible)</h2>
<p>Ollama exposes an OpenAI-compatible REST API at <code>http://localhost:11434/v1</code>, allowing any tool or application that supports the OpenAI SDK to use Gemma 4 31B as a drop-in replacement. This means you can point VS Code extensions like Continue, Python scripts using the <code>openai</code> library, or LangChain pipelines directly at your local Gemma 4 31B instance without modifying code — just change the base URL and set the API key to any non-empty string (Ollama ignores it but the SDK requires a value). This makes Gemma 4 31B an immediately usable private coding assistant with zero monthly cost, zero rate limits, and no data ever leaving your machine.</p>
<p><strong>Python (OpenAI SDK):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI(
</span></span><span style="display:flex;"><span>    base_url<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;http://localhost:11434/v1&#34;</span>,
</span></span><span style="display:flex;"><span>    api_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;ollama&#34;</span>,  <span style="color:#75715e"># required but ignored by Ollama</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gemma4:31b&#34;</span>,
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;Review this Python function for bugs: def parse(x): return x[&#39;data&#39;]&#34;</span>}
</span></span><span style="display:flex;"><span>    ]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>print(response<span style="color:#f92672">.</span>choices[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>message<span style="color:#f92672">.</span>content)
</span></span></code></pre></div><p><strong>curl:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/v1/chat/completions <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;model&#34;: &#34;gemma4:31b&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;messages&#34;: [{&#34;role&#34;: &#34;user&#34;, &#34;content&#34;: &#34;Hello, Gemma 4!&#34;}]
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">  }&#39;</span>
</span></span></code></pre></div><p><strong>Native Ollama API (also available):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/generate <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{&#34;model&#34;: &#34;gemma4:31b&#34;, &#34;prompt&#34;: &#34;Explain gradient descent&#34;}&#39;</span>
</span></span></code></pre></div><h3 id="streaming-responses">Streaming Responses</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>stream <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gemma4:31b&#34;</span>,
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;Write a FastAPI endpoint for user authentication&#34;</span>}],
</span></span><span style="display:flex;"><span>    stream<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>,
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> chunk <span style="color:#f92672">in</span> stream:
</span></span><span style="display:flex;"><span>    print(chunk<span style="color:#f92672">.</span>choices[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>delta<span style="color:#f92672">.</span>content <span style="color:#f92672">or</span> <span style="color:#e6db74">&#34;&#34;</span>, end<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;&#34;</span>)
</span></span></code></pre></div><h2 id="gemma-4-31b-benchmarks-how-it-stacks-up-against-gpt-4o-and-llama-4">Gemma 4 31B Benchmarks: How It Stacks Up Against GPT-4o and Llama 4</h2>
<p>Gemma 4 31B Dense achieves state-of-the-art results for its parameter count, posting 87.1% on MMLU versus GPT-4o&rsquo;s 86.5% — a meaningful reversal given the cost difference (free vs. API pricing). On GPQA Diamond, a graduate-level science benchmark that measures genuine reasoning depth, Gemma 4 31B scores 84.3%, compared to Llama 4 Scout&rsquo;s 74.3% despite Scout having a 109B total parameter count. The AIME 2026 score of 89.2% places it among the top tier of math-capable models available to run without an API. As of April 2026, Gemma 4 31B ranks #3 on the Chatbot Arena (LMSYS) leaderboard — the only fully open model in the top five. This makes it the strongest option for teams that need GPT-4o-class reasoning performance in an air-gapped or privacy-first deployment.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Gemma 4 31B</th>
          <th>GPT-4o</th>
          <th>Llama 4 Scout (109B MoE)</th>
          <th>Claude 3.7 Sonnet</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MMLU</td>
          <td>87.1%</td>
          <td>86.5%</td>
          <td>83.2%</td>
          <td>88.3%</td>
      </tr>
      <tr>
          <td>GPQA Diamond</td>
          <td>84.3%</td>
          <td>83.4%</td>
          <td>74.3%</td>
          <td>84.8%</td>
      </tr>
      <tr>
          <td>AIME 2026</td>
          <td>89.2%</td>
          <td>83.1%</td>
          <td>67.4%</td>
          <td>86.5%</td>
      </tr>
      <tr>
          <td>HumanEval</td>
          <td>85.4%</td>
          <td>87.0%</td>
          <td>79.3%</td>
          <td>86.1%</td>
      </tr>
      <tr>
          <td>Arena Rank</td>
          <td>#3</td>
          <td>#2</td>
          <td>#7</td>
          <td>#1</td>
      </tr>
  </tbody>
</table>
<p><em>Benchmarks sourced from Google DeepMind release notes and third-party evaluations, April–May 2026.</em></p>
<h2 id="optimization-tips-quantization-gpu-layers-and-context-window-tuning">Optimization Tips: Quantization, GPU Layers, and Context Window Tuning</h2>
<p>Ollama&rsquo;s default Q4_K_M quantization is the right choice for most users, reducing VRAM usage by 55–60% versus FP16 with minimal quality degradation. But beyond quantization format, there are several settings worth tuning to maximize performance on your specific hardware. The most impactful variable is GPU layer offloading (<code>num_gpu</code>) — Ollama automatically offloads as many layers as fit in VRAM, but you can override this with an <code>Modelfile</code>. Context window size (<code>num_ctx</code>) also directly affects VRAM usage: Gemma 4 31B supports 256K tokens, but setting a 4K or 8K context for coding tasks frees significant memory for additional parallel requests.</p>
<p><strong>Create a custom Modelfile for tuned inference:</strong></p>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 392 169"
      >
      <g transform='translate(8,16)'>
<circle cx='216' cy='80' r='6' stroke='currentColor' fill='#fff'></circle>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>F</text>
<text text-anchor='middle' x='0' y='36' fill='currentColor' style='font-size:1em'>#</text>
<text text-anchor='middle' x='0' y='52' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='0' y='84' fill='currentColor' style='font-size:1em'>#</text>
<text text-anchor='middle' x='0' y='100' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='0' y='132' fill='currentColor' style='font-size:1em'>#</text>
<text text-anchor='middle' x='0' y='148' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='8' y='52' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='8' y='100' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='8' y='148' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='16' y='36' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='16' y='52' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='16' y='84' fill='currentColor' style='font-size:1em'>F</text>
<text text-anchor='middle' x='16' y='100' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='16' y='132' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='16' y='148' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='24' y='52' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='24' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='24' y='100' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='24' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='24' y='148' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='32' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='32' y='52' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='32' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='32' y='100' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='32' y='132' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='32' y='148' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='40' y='52' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='40' y='84' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='40' y='100' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='40' y='132' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='40' y='148' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='48' y='36' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='48' y='52' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='48' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='48' y='100' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='48' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='48' y='148' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='56' y='36' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='56' y='52' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='56' y='100' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='56' y='132' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='56' y='148' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='64' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='64' y='52' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='64' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='64' y='100' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='64' y='132' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='64' y='148' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='72' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='72' y='84' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='72' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='80' y='52' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='80' y='84' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='80' y='100' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='80' y='132' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='80' y='148' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='88' y='36' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='88' y='52' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='88' y='100' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='88' y='132' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='88' y='148' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'>3</text>
<text text-anchor='middle' x='96' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='96' y='52' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='96' y='84' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='96' y='100' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='96' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='96' y='148' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='104' y='4' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='104' y='52' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='104' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='104' y='100' fill='currentColor' style='font-size:1em'>_</text>
<text text-anchor='middle' x='104' y='148' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='112' y='4' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='112' y='36' fill='currentColor' style='font-size:1em'>w</text>
<text text-anchor='middle' x='112' y='52' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='112' y='84' fill='currentColor' style='font-size:1em'>y</text>
<text text-anchor='middle' x='112' y='100' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='112' y='132' fill='currentColor' style='font-size:1em'>f</text>
<text text-anchor='middle' x='112' y='148' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='120' y='52' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='120' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='120' y='100' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='120' y='132' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='120' y='148' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='128' y='36' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='128' y='52' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='128' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='128' y='100' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='128' y='132' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='128' y='148' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='136' y='36' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='136' y='84' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='136' y='148' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='144' y='36' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='144' y='52' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='144' y='100' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='144' y='132' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='144' y='148' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='152' y='36' fill='currentColor' style='font-size:1em'>w</text>
<text text-anchor='middle' x='152' y='52' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='152' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='152' y='100' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='152' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='152' y='148' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='160' y='52' fill='currentColor' style='font-size:1em'>9</text>
<text text-anchor='middle' x='160' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='160' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='160' y='148' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='168' y='36' fill='currentColor' style='font-size:1em'>(</text>
<text text-anchor='middle' x='168' y='52' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='168' y='132' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='176' y='36' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='176' y='84' fill='currentColor' style='font-size:1em'>G</text>
<text text-anchor='middle' x='176' y='132' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='176' y='148' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='184' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='184' y='84' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='184' y='132' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='184' y='148' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='192' y='36' fill='currentColor' style='font-size:1em'>f</text>
<text text-anchor='middle' x='192' y='84' fill='currentColor' style='font-size:1em'>U</text>
<text text-anchor='middle' x='192' y='132' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='192' y='148' fill='currentColor' style='font-size:1em'>1</text>
<text text-anchor='middle' x='200' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='200' y='132' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='208' y='36' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='208' y='132' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='216' y='36' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='216' y='132' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='224' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='224' y='84' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='224' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='232' y='36' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='232' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='232' y='132' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='240' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='240' y='132' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='248' y='36' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='248' y='84' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='256' y='36' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='256' y='84' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='256' y='132' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='264' y='36' fill='currentColor' style='font-size:1em'>4</text>
<text text-anchor='middle' x='264' y='84' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='264' y='132' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='272' y='36' fill='currentColor' style='font-size:1em'>8</text>
<text text-anchor='middle' x='272' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='272' y='132' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='280' y='36' fill='currentColor' style='font-size:1em'>,</text>
<text text-anchor='middle' x='280' y='132' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='288' y='84' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='288' y='132' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='296' y='36' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='296' y='84' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='296' y='132' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='304' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='304' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='312' y='36' fill='currentColor' style='font-size:1em'>x</text>
<text text-anchor='middle' x='312' y='84' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='312' y='132' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='320' y='36' fill='currentColor' style='font-size:1em'>:</text>
<text text-anchor='middle' x='320' y='84' fill='currentColor' style='font-size:1em'>-</text>
<text text-anchor='middle' x='320' y='132' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='328' y='84' fill='currentColor' style='font-size:1em'>d</text>
<text text-anchor='middle' x='328' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='336' y='36' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='336' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='336' y='132' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='344' y='36' fill='currentColor' style='font-size:1em'>5</text>
<text text-anchor='middle' x='344' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='344' y='132' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='352' y='36' fill='currentColor' style='font-size:1em'>6</text>
<text text-anchor='middle' x='352' y='84' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='352' y='132' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='360' y='36' fill='currentColor' style='font-size:1em'>K</text>
<text text-anchor='middle' x='360' y='84' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='368' y='36' fill='currentColor' style='font-size:1em'>)</text>
<text text-anchor='middle' x='368' y='84' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='376' y='84' fill='currentColor' style='font-size:1em'>)</text>
</g>

    </svg>
  
</div>
<p><strong>Build and run the custom model:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama create gemma4-coding -f Modelfile
</span></span><span style="display:flex;"><span>ollama run gemma4-coding
</span></span></code></pre></div><p><strong>Quantization options and trade-offs:</strong></p>
<table>
  <thead>
      <tr>
          <th>Format</th>
          <th>VRAM (31B)</th>
          <th>Quality</th>
          <th>Speed</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>FP16</td>
          <td>~62 GB</td>
          <td>Best</td>
          <td>Fastest per token</td>
      </tr>
      <tr>
          <td>Q8_0</td>
          <td>~33 GB</td>
          <td>Near-lossless</td>
          <td>Fast</td>
      </tr>
      <tr>
          <td>Q4_K_M</td>
          <td>~18–24 GB</td>
          <td>Good (default)</td>
          <td>Good</td>
      </tr>
      <tr>
          <td>Q4_0</td>
          <td>~17 GB</td>
          <td>Slightly lower</td>
          <td>Slightly faster</td>
      </tr>
      <tr>
          <td>Q3_K_M</td>
          <td>~14 GB</td>
          <td>Acceptable</td>
          <td>Fast on low VRAM</td>
      </tr>
  </tbody>
</table>
<h3 id="monitoring-gpu-utilization">Monitoring GPU Utilization</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># NVIDIA</span>
</span></span><span style="display:flex;"><span>watch -n <span style="color:#ae81ff">1</span> nvidia-smi
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Mac (using powermetrics or mactop)</span>
</span></span><span style="display:flex;"><span>mactop
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Check Ollama model status</span>
</span></span><span style="display:flex;"><span>ollama ps
</span></span></code></pre></div><h2 id="common-errors-and-fixes-when-running-gemma-4-31b-locally">Common Errors and Fixes When Running Gemma 4 31B Locally</h2>
<p>Most failures when running Gemma 4 31B locally fall into three categories: insufficient VRAM causing OOM errors, Ollama version mismatches that predate Gemma 4 support, and port conflicts preventing the API from starting. These are all straightforward to diagnose and fix — Ollama&rsquo;s error messages are specific enough to point directly to the root cause in most cases. The most common mistake is attempting to run the 31B model on a GPU with less than 20GB VRAM without adjusting quantization. The second most common is running Ollama 0.4.x, which predates the <code>gemma4</code> model tag and returns a &ldquo;model not found&rdquo; error regardless of what you pull.</p>
<p><strong>Error: <code>CUDA out of memory</code> or <code>error: model requires more system memory</code></strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Check available VRAM</span>
</span></span><span style="display:flex;"><span>nvidia-smi --query-gpu<span style="color:#f92672">=</span>memory.free,memory.total --format<span style="color:#f92672">=</span>csv
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Solution: Force a lower quantization by pulling a specific GGUF tag</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:31b-q3_k_m  <span style="color:#75715e"># ~14GB VRAM</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Or switch to the 26B model</span>
</span></span><span style="display:flex;"><span>ollama pull gemma4:26b
</span></span></code></pre></div><p><strong>Error: <code>model &quot;gemma4:31b&quot; not found</code></strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Check Ollama version (needs 0.5+)</span>
</span></span><span style="display:flex;"><span>ollama --version
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Update Ollama</span>
</span></span><span style="display:flex;"><span>curl -fsSL https://ollama.com/install.sh | sh  <span style="color:#75715e"># Linux</span>
</span></span><span style="display:flex;"><span>brew upgrade ollama  <span style="color:#75715e"># Mac</span>
</span></span></code></pre></div><p><strong>Error: <code>listen tcp :11434: bind: address already in use</code></strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Another process is using port 11434</span>
</span></span><span style="display:flex;"><span>lsof -i :11434
</span></span><span style="display:flex;"><span>kill -9 &lt;PID&gt;
</span></span><span style="display:flex;"><span>ollama serve
</span></span></code></pre></div><p><strong>Slow generation speed (&lt; 5 tok/s on RTX 4090)</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Verify GPU is being used, not CPU</span>
</span></span><span style="display:flex;"><span>ollama ps  <span style="color:#75715e"># shows active model and runner type</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># If showing &#34;cpu&#34; runner, CUDA drivers may not be detected</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Reinstall with CUDA drivers:</span>
</span></span><span style="display:flex;"><span>OLLAMA_SKIP_GPU<span style="color:#f92672">=</span>false ollama serve
</span></span></code></pre></div><p><strong>Model loads but produces garbage output</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Corrupted model file — re-pull</span>
</span></span><span style="display:flex;"><span>ollama rm gemma4:31b
</span></span><span style="display:flex;"><span>ollama pull gemma4:31b
</span></span></code></pre></div><hr>
<h2 id="faq">FAQ</h2>
<p>The following questions address the most common issues and misconceptions when setting up Gemma 4 31B locally with Ollama. Hardware compatibility is the most frequent stumbling block — specifically the gap between a model&rsquo;s FP16 memory footprint and its quantized footprint. Gemma 4 31B at Q4_K_M requires roughly 18–24GB VRAM, not the 62GB you would need for FP16, which changes the hardware requirements dramatically. Other common points of confusion include the model variant naming (no &ldquo;27B&rdquo; variant exists in Gemma 4), offline operation capabilities (the model runs entirely air-gapped after the initial download completes), CPU fallback behavior when no compatible GPU is present, and licensing terms for commercial deployments. The Apache 2.0 license makes Gemma 4 31B fully usable in production environments without royalties or usage restrictions, which distinguishes it from some other open-weight models with more restrictive non-commercial terms.</p>
<h3 id="does-gemma-4-31b-require-an-internet-connection-after-download">Does Gemma 4 31B require an internet connection after download?</h3>
<p>No. Once <code>ollama pull gemma4:31b</code> completes, the model runs entirely offline. Ollama stores the weights in <code>~/.ollama/models/</code> and inference happens locally with no network calls. You can disconnect your machine from the internet and the model continues to work normally.</p>
<h3 id="can-i-run-gemma-4-31b-on-a-cpu-without-a-gpu">Can I run Gemma 4 31B on a CPU without a GPU?</h3>
<p>Yes, but it will be very slow. Ollama falls back to CPU inference automatically if no compatible GPU is detected. Expect 1–3 tokens per second on a modern desktop CPU versus 30–60+ tokens per second on an RTX 4090. For practical use, a GPU with at least 20GB VRAM is strongly recommended for the 31B variant.</p>
<h3 id="what-is-the-difference-between-gemma-4-31b-and-gemma-4-27b">What is the difference between Gemma 4 31B and Gemma 4 27B?</h3>
<p>There is no official &ldquo;27B&rdquo; variant in the Gemma 4 family. The lineup is E2B, E4B, 12B, 26B, and 31B. Some confusion arises because earlier Gemma 2 had a 27B model. Gemma 4 31B is the top-tier dense model in the current release.</p>
<h3 id="how-do-i-update-gemma-4-to-a-newer-version-when-google-releases-one">How do I update Gemma 4 to a newer version when Google releases one?</h3>
<p>Run <code>ollama pull gemma4:31b</code> again. Ollama checks the registry for a newer manifest and downloads only the changed layers if an update is available. You can also use <code>ollama pull gemma4:latest</code> to always fetch the most recent Gemma 4 variant automatically.</p>
<h3 id="is-gemma-4-31b-safe-to-use-in-production-with-real-user-data">Is Gemma 4 31B safe to use in production with real user data?</h3>
<p>Gemma 4 31B is Apache 2.0 licensed, so commercial use is permitted without restriction. For production deployments handling sensitive user data, running it locally with Ollama is actually the privacy-correct approach — no data is sent to third-party servers. However, like all language models, it can produce hallucinations and should not be used for safety-critical decisions without human review and output validation.</p>
]]></content:encoded></item></channel></rss>