<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Coding Agent on RockB</title><link>https://baeseokjae.github.io/tags/coding-agent/</link><description>Recent content in Coding Agent on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Thu, 30 Apr 2026 06:04:25 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/coding-agent/index.xml" rel="self" type="application/rss+xml"/><item><title>Devstral Small 2 Local Setup Guide 2026: Run Mistral Coding Agent on Your Laptop</title><link>https://baeseokjae.github.io/posts/devstral-small-2-local-setup-guide-2026/</link><pubDate>Thu, 30 Apr 2026 06:04:25 +0000</pubDate><guid>https://baeseokjae.github.io/posts/devstral-small-2-local-setup-guide-2026/</guid><description>Step-by-step guide to running Devstral Small 2 (24B, 68% SWE-bench) locally via Ollama, vLLM, or llama.cpp — zero API fees, full privacy.</description><content:encoded><![CDATA[<p>Devstral Small 2 is a 24B-parameter coding model from Mistral AI that scores 68% on SWE-bench Verified and runs on a single 24GB GPU or a Mac M-series with 32GB unified memory — making it the first cloud-grade coding agent most developers can realistically self-host. This guide covers three setup paths: Ollama for beginners, vLLM for production teams, and llama.cpp for CPU-only or low-VRAM machines.</p>
<h2 id="what-is-devstral-small-2">What Is Devstral Small 2?</h2>
<p>Devstral Small 2 is Mistral AI&rsquo;s open-weight coding specialist, released December 10, 2025 under the Apache 2.0 license. With 24 billion parameters and a 256K-token context window, it achieves 68.0% on SWE-bench Verified — a real-world benchmark measuring a model&rsquo;s ability to resolve open GitHub issues autonomously. That puts it on par with models up to five times its parameter count, including closed-source proprietary systems. Because it ships under Apache 2.0, you can run it locally with no API fees, no data leaving your machine, and no usage restrictions — even in commercial projects. The model is fine-tuned specifically on agentic coding workflows: reading multi-file codebases, writing patches, running tool calls, and self-correcting from test failures. Devstral Small 2 outperforms Qwen 3 Coder Flash (30B) despite being a smaller model, and its larger sibling Devstral 2 (123B) hits 72.2%, compared to Claude Sonnet 4.5&rsquo;s 77.2% — at up to 7x lower cost per coding task. For teams or individuals who need a capable coding agent without cloud dependency, Devstral Small 2 is the most practical choice available today.</p>
<h3 id="how-does-it-compare-to-github-copilot">How Does It Compare to GitHub Copilot?</h3>
<p>GitHub Copilot is a cloud-only SaaS product that sends your code to Microsoft/OpenAI servers for every completion. Devstral Small 2 runs entirely on your hardware — no telemetry, no subscription, no code leaving your machine. For developers working with proprietary codebases, HIPAA-sensitive data, or airgapped environments, this distinction is not a preference but a requirement. In terms of raw capability, Devstral Small 2 handles multi-file edits, shell tool use, and autonomous bug-fix loops that Copilot&rsquo;s autocomplete model is not designed for.</p>
<h2 id="hardware-requirements--can-your-laptop-run-it">Hardware Requirements — Can Your Laptop Run It?</h2>
<p>Devstral Small 2 requires a minimum of 24GB VRAM to run at full precision (FP16) on a discrete GPU, or 32GB unified memory on Apple Silicon Macs. With Q4_K_M quantization — which preserves 95%+ of model quality at 40% reduced memory — you can run it on a single RTX 3090, RTX 4090, or any GPU with 16GB+ VRAM, though performance is faster with 24GB. Apple M1, M2, M3, and M4 chips with 32GB unified memory can run the model without any discrete GPU because macOS treats CPU and GPU memory as a shared pool. Windows and Linux users with 16GB VRAM GPUs should use GGUF quantization via llama.cpp for the best performance-to-memory tradeoff. CPU-only inference is possible but slow — expect 1–3 tokens per second on a modern AMD or Intel machine with 64GB RAM, which is usable for testing but not for interactive coding sessions. The model&rsquo;s 256K context window is memory-hungry: fitting a 128K context requires approximately 20GB VRAM even at Q4. For most local setups, keep context under 32K unless you have 24GB+ VRAM available.</p>
<table>
  <thead>
      <tr>
          <th>Hardware</th>
          <th>VRAM / RAM</th>
          <th>Recommended Method</th>
          <th>Speed (est.)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RTX 4090</td>
          <td>24GB VRAM</td>
          <td>Ollama / vLLM</td>
          <td>25–40 tok/s</td>
      </tr>
      <tr>
          <td>RTX 3090</td>
          <td>24GB VRAM</td>
          <td>Ollama (Q4_K_M)</td>
          <td>15–25 tok/s</td>
      </tr>
      <tr>
          <td>RTX 4080</td>
          <td>16GB VRAM</td>
          <td>llama.cpp Q4_K_M</td>
          <td>10–18 tok/s</td>
      </tr>
      <tr>
          <td>Mac M3 Max 96GB</td>
          <td>96GB unified</td>
          <td>Ollama (Metal)</td>
          <td>20–35 tok/s</td>
      </tr>
      <tr>
          <td>Mac M2 Pro 32GB</td>
          <td>32GB unified</td>
          <td>Ollama (Metal)</td>
          <td>8–15 tok/s</td>
      </tr>
      <tr>
          <td>CPU only (64GB RAM)</td>
          <td>64GB RAM</td>
          <td>llama.cpp Q4_K_M</td>
          <td>1–3 tok/s</td>
      </tr>
  </tbody>
</table>
<h2 id="method-1-ollama-setup-easiest--3-commands">Method 1: Ollama Setup (Easiest — 3 Commands)</h2>
<p>Ollama is the fastest path to running Devstral Small 2 locally — it handles model download, quantization selection, and an OpenAI-compatible API server in three commands. Install Ollama from the official site for your OS (macOS, Linux, or Windows WSL2), then pull and run the model. No manual CUDA configuration is required; Ollama detects your GPU automatically and selects the appropriate backend. The Q4_K_M quantization variant is used by default for Devstral Small 2 on Ollama, delivering the best balance of speed and quality for consumer hardware. Once the model is running, Ollama exposes a local HTTP server at <code>localhost:11434</code> with an OpenAI-compatible <code>/v1/chat/completions</code> endpoint, which means any tool that works with OpenAI models — LangChain, LlamaIndex, Continue.dev, OpenHands — can be pointed at your local Devstral instance with a single URL change. The model download is approximately 14GB for the Q4_K_M variant.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Step 1: Install Ollama (macOS example)</span>
</span></span><span style="display:flex;"><span>brew install ollama
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 2: Start the Ollama service</span>
</span></span><span style="display:flex;"><span>ollama serve
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 3: Pull and run Devstral Small 2</span>
</span></span><span style="display:flex;"><span>ollama run devstral-small-2
</span></span></code></pre></div><p>For Linux, replace the brew command with:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl -fsSL https://ollama.com/install.sh | sh
</span></span></code></pre></div><p>Once running, test the API directly:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/v1/chat/completions <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;model&#34;: &#34;devstral-small-2&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;messages&#34;: [{&#34;role&#34;: &#34;user&#34;, &#34;content&#34;: &#34;Write a Python function to parse a CSV file&#34;}]
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">  }&#39;</span>
</span></span></code></pre></div><h3 id="configure-ollama-for-longer-context">Configure Ollama for Longer Context</h3>
<p>By default Ollama caps context at 2048 tokens for memory safety. To unlock the full 256K window (or a practical 32K for most tasks), create a custom Modelfile:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>cat &gt; Modelfile <span style="color:#e6db74">&lt;&lt; &#39;EOF&#39;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">FROM devstral-small-2
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">PARAMETER num_ctx 32768
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">EOF</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>ollama create devstral-32k -f Modelfile
</span></span><span style="display:flex;"><span>ollama run devstral-32k
</span></span></code></pre></div><h2 id="method-2-vllm-server-setup-production-grade">Method 2: vLLM Server Setup (Production-Grade)</h2>
<p>vLLM is the preferred deployment method for teams running Devstral Small 2 as a shared inference server — it offers higher throughput, continuous batching, and PagedAttention memory management that Ollama does not implement. vLLM requires a CUDA-compatible GPU on Linux (Windows is not supported), and Python 3.9+. The setup installs via pip and takes under five minutes once CUDA drivers are configured. vLLM exposes an OpenAI-compatible REST API on port 8000 by default, making it a drop-in replacement for the OpenAI endpoint in any application. For production deployments, vLLM&rsquo;s continuous batching can serve multiple concurrent coding sessions without the memory overhead of spawning separate model instances per request. At 24B parameters with A100-80GB hardware, expect 60–80 tokens per second with batched inference. On a single RTX 4090, expect 25–35 tok/s single-user throughput. vLLM also supports tensor parallelism across multiple GPUs — useful if you have two 16GB cards and want to treat them as 32GB combined.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Install vLLM</span>
</span></span><span style="display:flex;"><span>pip install vllm
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Launch the server (single GPU)</span>
</span></span><span style="display:flex;"><span>python -m vllm.entrypoints.openai.api_server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model mistralai/Devstral-Small-2 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --served-model-name devstral-small-2 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --port <span style="color:#ae81ff">8000</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --max-model-len <span style="color:#ae81ff">32768</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># For tensor parallelism across 2 GPUs:</span>
</span></span><span style="display:flex;"><span>python -m vllm.entrypoints.openai.api_server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model mistralai/Devstral-Small-2 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --tensor-parallel-size <span style="color:#ae81ff">2</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --max-model-len <span style="color:#ae81ff">65536</span>
</span></span></code></pre></div><p>Test the vLLM server:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:8000/v1/chat/completions <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;model&#34;: &#34;devstral-small-2&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;messages&#34;: [{&#34;role&#34;: &#34;user&#34;, &#34;content&#34;: &#34;Review this Python function for bugs&#34;}]
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">  }&#39;</span>
</span></span></code></pre></div><h3 id="quantization-with-vllm">Quantization with vLLM</h3>
<p>For 16GB VRAM GPUs, add <code>--dtype float16</code> and <code>--gpu-memory-utilization 0.90</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>python -m vllm.entrypoints.openai.api_server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model mistralai/Devstral-Small-2 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --dtype float16 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --gpu-memory-utilization 0.90 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --max-model-len <span style="color:#ae81ff">16384</span>
</span></span></code></pre></div><h2 id="method-3-llamacpp--gguf-cpu-or-low-vram-machines">Method 3: llama.cpp + GGUF (CPU or Low-VRAM Machines)</h2>
<p>llama.cpp with GGUF quantization is the right method when you have less than 16GB VRAM or want to run Devstral Small 2 entirely on CPU. GGUF is a portable model format that supports mixed-precision quantization — different layers use different bit depths — which means the model fits in less memory while preserving most of the accuracy on higher-reasoning tasks. The Q4_K_M quantization variant is the community consensus for best quality-to-size ratio: it reduces the 24B model from ~48GB (FP16) to approximately 14GB, while maintaining 95%+ task accuracy on coding benchmarks. For machines with 8GB VRAM, you can offload 20 layers to GPU and run the rest on CPU (<code>--n-gpu-layers 20</code>), which gives 3–8 tok/s — slow but functional. llama.cpp also supports Metal on macOS, making it an alternative to Ollama for Mac users who want more control over layer offloading configuration. Install from source or use a prebuilt binary from the llama.cpp releases page.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Build llama.cpp from source (Linux/Mac)</span>
</span></span><span style="display:flex;"><span>git clone https://github.com/ggerganov/llama.cpp
</span></span><span style="display:flex;"><span>cd llama.cpp
</span></span><span style="display:flex;"><span>cmake -B build -DGGML_CUDA<span style="color:#f92672">=</span>ON  <span style="color:#75715e"># use DGGML_METAL=ON for Mac</span>
</span></span><span style="display:flex;"><span>cmake --build build --config Release -j<span style="color:#66d9ef">$(</span>nproc<span style="color:#66d9ef">)</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Download the GGUF file (Q4_K_M recommended)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># From HuggingFace: mistralai/Devstral-Small-2-GGUF</span>
</span></span><span style="display:flex;"><span>wget https://huggingface.co/mistralai/Devstral-Small-2-GGUF/resolve/main/devstral-small-2-q4_k_m.gguf
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Run inference (GPU offload)</span>
</span></span><span style="display:flex;"><span>./build/bin/llama-server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model devstral-small-2-q4_k_m.gguf <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --n-gpu-layers <span style="color:#ae81ff">35</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --ctx-size <span style="color:#ae81ff">16384</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --port <span style="color:#ae81ff">8080</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># CPU-only (no --n-gpu-layers flag)</span>
</span></span><span style="display:flex;"><span>./build/bin/llama-server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model devstral-small-2-q4_k_m.gguf <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --ctx-size <span style="color:#ae81ff">8192</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --port <span style="color:#ae81ff">8080</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --threads <span style="color:#66d9ef">$(</span>nproc<span style="color:#66d9ef">)</span>
</span></span></code></pre></div><h3 id="choose-the-right-gguf-quantization">Choose the Right GGUF Quantization</h3>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>File Size</th>
          <th>VRAM/RAM</th>
          <th>Quality vs FP16</th>
          <th>Best For</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Q2_K</td>
          <td>~9GB</td>
          <td>12GB</td>
          <td>~85%</td>
          <td>Testing only</td>
      </tr>
      <tr>
          <td>Q4_K_M</td>
          <td>~14GB</td>
          <td>16GB</td>
          <td>~95%</td>
          <td><strong>Recommended</strong></td>
      </tr>
      <tr>
          <td>Q5_K_M</td>
          <td>~17GB</td>
          <td>20GB</td>
          <td>~97%</td>
          <td>24GB VRAM systems</td>
      </tr>
      <tr>
          <td>Q8_0</td>
          <td>~25GB</td>
          <td>28GB</td>
          <td>~99%</td>
          <td>Near-lossless quality</td>
      </tr>
      <tr>
          <td>F16</td>
          <td>~48GB</td>
          <td>50GB</td>
          <td>100%</td>
          <td>Dedicated A100/H100</td>
      </tr>
  </tbody>
</table>
<h2 id="connect-to-mistral-vibe-cli-for-terminal-coding">Connect to Mistral Vibe CLI for Terminal Coding</h2>
<p>Mistral Vibe CLI is Mistral&rsquo;s official terminal-native coding agent — the equivalent of Cursor or Claude Code, but designed to work with local models via any OpenAI-compatible endpoint. Once Devstral Small 2 is running via Ollama or vLLM, you point Vibe CLI at your local server and get a full agentic coding session in your terminal: multi-file edits, shell command execution, test running, and autonomous bug-fix loops — all without leaving the command line or sending code to the cloud. The CLI reads your project directory, maintains a conversation about the codebase, and can autonomously apply patches across multiple files when given a high-level task like &ldquo;fix all failing tests&rdquo; or &ldquo;add OAuth2 to the auth module.&rdquo; This is the closest local equivalent to GitHub Copilot Workspace or Devin for developers who prioritize privacy.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Install Mistral Vibe CLI</span>
</span></span><span style="display:flex;"><span>pip install mistral-vibe
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Configure for local Ollama endpoint</span>
</span></span><span style="display:flex;"><span>cat &gt; ~/.vibe/config.toml <span style="color:#e6db74">&lt;&lt; &#39;EOF&#39;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">[model]
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">provider = &#34;openai-compatible&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">base_url = &#34;http://localhost:11434/v1&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">model = &#34;devstral-small-2&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">api_key = &#34;ollama&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">[agent]
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">auto_approve = false
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">max_file_changes = 50
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">EOF</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Start a coding session</span>
</span></span><span style="display:flex;"><span>cd /your/project
</span></span><span style="display:flex;"><span>vibe <span style="color:#e6db74">&#34;Refactor the authentication module to use JWT tokens&#34;</span>
</span></span></code></pre></div><p>For vLLM backend, change <code>base_url</code> to <code>http://localhost:8000/v1</code>.</p>
<h3 id="auto-approval-mode">Auto-Approval Mode</h3>
<p>For trusted projects where you want fully autonomous operation:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span>[<span style="color:#ae81ff">agent]</span>
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">auto_approve = true</span>
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">allowed_tools = [&#34;read_file&#34;, &#34;write_file&#34;, &#34;run_shell&#34;]</span>
</span></span><span style="display:flex;"><span><span style="color:#ae81ff">denied_tools = [&#34;delete_file&#34;, &#34;network_request&#34;]</span>
</span></span></code></pre></div><p>Auto-approval is off by default. Enable it only in isolated dev environments — it will execute shell commands without confirmation.</p>
<h2 id="integrate-devstral-small-2-with-openhands">Integrate Devstral Small 2 with OpenHands</h2>
<p>OpenHands (formerly OpenDevin) is an open-source autonomous software engineering platform that wraps a coding model in a Docker-based sandbox, giving it persistent file access, shell execution, and browser automation. Integrating Devstral Small 2 with OpenHands gives you a self-hosted version of Devin-like autonomous coding — the agent can read your codebase, run tests, browse documentation, write multi-file patches, and iterate until the tests pass, all locally. OpenHands supports any OpenAI-compatible backend, making the Ollama or vLLM integration straightforward. The combination of Devstral Small 2&rsquo;s 68% SWE-bench score with OpenHands&rsquo;s tool execution scaffolding is currently one of the most capable local autonomous coding setups available. Setup requires Docker and about 5 minutes of configuration.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Install OpenHands</span>
</span></span><span style="display:flex;"><span>pip install openhands-ai
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Or run via Docker (recommended)</span>
</span></span><span style="display:flex;"><span>docker pull ghcr.io/all-hands-ai/openhands:latest
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Launch with local Devstral endpoint</span>
</span></span><span style="display:flex;"><span>docker run -it <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -e LLM_API_KEY<span style="color:#f92672">=</span>ollama <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -e LLM_BASE_URL<span style="color:#f92672">=</span>http://host.docker.internal:11434/v1 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -e LLM_MODEL<span style="color:#f92672">=</span>devstral-small-2 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -v /your/workspace:/workspace <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -p 3000:3000 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  ghcr.io/all-hands-ai/openhands:latest
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Open http://localhost:3000 in your browser</span>
</span></span></code></pre></div><p>For vLLM backend, replace <code>LLM_BASE_URL</code> with <code>http://host.docker.internal:8000/v1</code>.</p>
<h3 id="openhands-task-examples">OpenHands Task Examples</h3>
<p>Once connected, you can give OpenHands high-level tasks:</p>
<ul>
<li>&ldquo;Add unit tests for all functions in <code>src/auth/</code>&rdquo;</li>
<li>&ldquo;Find and fix the memory leak in the websocket handler&rdquo;</li>
<li>&ldquo;Migrate this codebase from Python 2 to Python 3&rdquo;</li>
<li>&ldquo;Implement the GitHub issue #47 according to the spec in SPEC.md&rdquo;</li>
</ul>
<h2 id="devstral-small-2-vs-cloud-alternatives-is-local-worth-it">Devstral Small 2 vs. Cloud Alternatives: Is Local Worth It?</h2>
<p>Running Devstral Small 2 locally is worth it under specific conditions, and not worth it under others — the decision depends on your usage volume, hardware, and privacy requirements. At $0.10/$0.30 per million input/output tokens via the Mistral API, the hosted version is already among the cheapest coding models available, making local deployment a breakeven proposition for casual users who run under a few million tokens per month. But for teams with high token volume, local deployment eliminates per-token costs entirely after the hardware investment. A single RTX 4090 ($1,500–1,800) breaks even against API costs at roughly 5–10 million output tokens — achievable in 2–4 weeks of heavy use. For privacy-sensitive workloads — healthcare software, financial systems, proprietary algorithms — local deployment is not a cost decision but a compliance requirement. The 256K context window also becomes a cost driver at scale: processing a 50K-token codebase context with every request at API prices adds up quickly; locally, it&rsquo;s free.</p>
<table>
  <thead>
      <tr>
          <th>Factor</th>
          <th>Local (Devstral Small 2)</th>
          <th>Cloud (Mistral API)</th>
          <th>Cloud (Claude Sonnet 4.5)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SWE-bench Score</td>
          <td>68.0%</td>
          <td>68.0%</td>
          <td>77.2%</td>
      </tr>
      <tr>
          <td>Cost per 1M output tokens</td>
          <td>$0 (after hardware)</td>
          <td>$0.30</td>
          <td>~$15.00</td>
      </tr>
      <tr>
          <td>Privacy</td>
          <td>100% on-device</td>
          <td>Mistral data policy</td>
          <td>Anthropic data policy</td>
      </tr>
      <tr>
          <td>Context window</td>
          <td>256K</td>
          <td>256K</td>
          <td>200K</td>
      </tr>
      <tr>
          <td>Setup time</td>
          <td>5–30 min</td>
          <td>Instant</td>
          <td>Instant</td>
      </tr>
      <tr>
          <td>Latency</td>
          <td>8–40 tok/s</td>
          <td>50–100 tok/s</td>
          <td>40–80 tok/s</td>
      </tr>
      <tr>
          <td>Offline capability</td>
          <td>Yes</td>
          <td>No</td>
          <td>No</td>
      </tr>
  </tbody>
</table>
<p>Cloud wins on latency and zero setup. Local wins on privacy, cost at scale, and offline operation.</p>
<h2 id="troubleshooting-common-setup-issues">Troubleshooting Common Setup Issues</h2>
<p>Most Devstral Small 2 setup failures fall into four categories: CUDA version mismatch, insufficient VRAM, context length errors, and model format issues. Diagnosing these systematically saves significant time compared to trial-and-error. CUDA mismatches are the most common issue on Linux — Ollama requires CUDA 11.8+ and vLLM requires CUDA 12.1+. Check your driver version with <code>nvidia-smi</code> and ensure it supports your installed CUDA toolkit. VRAM errors appear as &ldquo;CUDA out of memory&rdquo; exceptions during model loading; the fix is either to reduce <code>--max-model-len</code> (for vLLM) or switch to a lower quantization variant. Context length errors during inference mean your prompt exceeds the configured <code>num_ctx</code> (Ollama) or <code>--max-model-len</code> (vLLM) — increase these parameters or reduce your prompt size. On macOS, model download stalls are often caused by Ollama&rsquo;s background service not starting; run <code>ollama serve</code> explicitly in a terminal and check for permission errors in <code>~/.ollama/logs/</code>.</p>
<p><strong>Issue: &ldquo;CUDA out of memory&rdquo; on model load</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Check actual VRAM usage</span>
</span></span><span style="display:flex;"><span>nvidia-smi --query-gpu<span style="color:#f92672">=</span>memory.used,memory.free --format<span style="color:#f92672">=</span>csv
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># vLLM: reduce model length</span>
</span></span><span style="display:flex;"><span>python -m vllm.entrypoints.openai.api_server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model mistralai/Devstral-Small-2 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --max-model-len <span style="color:#ae81ff">8192</span>  <span style="color:#75715e"># reduce from default</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Ollama: use smaller quantization</span>
</span></span><span style="display:flex;"><span>ollama pull devstral-small-2:q4_0  <span style="color:#75715e"># smaller than q4_k_m</span>
</span></span></code></pre></div><p><strong>Issue: Ollama not detecting GPU on Linux</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Verify CUDA driver</span>
</span></span><span style="display:flex;"><span>nvidia-smi
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Check Ollama GPU detection</span>
</span></span><span style="display:flex;"><span>ollama run devstral-small-2 --verbose 2&gt;&amp;<span style="color:#ae81ff">1</span> | grep -i gpu
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Reinstall with GPU support</span>
</span></span><span style="display:flex;"><span>curl -fsSL https://ollama.com/install.sh | sh
</span></span></code></pre></div><p><strong>Issue: Slow inference on Mac (&lt; 5 tok/s)</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Ensure Metal is enabled (should be automatic)</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Check if running CPU fallback</span>
</span></span><span style="display:flex;"><span>ollama run devstral-small-2 --verbose 2&gt;&amp;<span style="color:#ae81ff">1</span> | grep -i metal
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Increase num_ctx to use Metal more efficiently</span>
</span></span><span style="display:flex;"><span>ollama run devstral-small-2 --num-ctx <span style="color:#ae81ff">8192</span>
</span></span></code></pre></div><p><strong>Issue: vLLM tokenizer error on Devstral Small 2</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Install correct tokenizer dependencies</span>
</span></span><span style="display:flex;"><span>pip install transformers&gt;<span style="color:#f92672">=</span>4.40.0 sentencepiece
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Use trust_remote_code flag</span>
</span></span><span style="display:flex;"><span>python -m vllm.entrypoints.openai.api_server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model mistralai/Devstral-Small-2 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --trust-remote-code
</span></span></code></pre></div><h2 id="faq">FAQ</h2>
<p><strong>Can Devstral Small 2 run on a laptop with 16GB RAM and no dedicated GPU?</strong>
Yes, but slowly. Using llama.cpp with Q4_K_M quantization and a context of 4096 tokens, expect 0.5–2 tokens per second on a modern CPU. This is functional for testing or occasional queries but too slow for interactive coding sessions. A Mac with 32GB unified memory is significantly better than a PC with 16GB system RAM and no discrete GPU — the Mac&rsquo;s memory bandwidth is purpose-built for this workload.</p>
<p><strong>Is the Apache 2.0 license really free for commercial use?</strong>
Yes. Apache 2.0 permits use, modification, and distribution in commercial software without royalties or attribution requirements beyond including the license file. There are no &ldquo;non-commercial only&rdquo; restrictions. You can embed Devstral Small 2 in a product, use it internally in a company, or sell a service built on it — all without contacting Mistral.</p>
<p><strong>How does Devstral Small 2 handle multi-file edits?</strong>
Devstral Small 2 is fine-tuned specifically for agentic tool use, including multi-file read/write operations. When connected to a tool-calling harness (Vibe CLI, OpenHands, or a custom agent), it can read multiple files into context, reason about dependencies, and output structured patches for each file. Without a tool harness, it&rsquo;s a standard chat model that requires you to paste file contents manually.</p>
<p><strong>What context length should I use for everyday coding tasks?</strong>
For most coding tasks — reviewing a function, writing a new module, fixing a bug — 8K–16K tokens is sufficient. The 256K window is most valuable for large refactors where you need to hold an entire codebase in context simultaneously. Using large context windows increases memory usage and inference latency, so set <code>num_ctx</code> to the minimum needed for your actual task.</p>
<p><strong>Can I fine-tune Devstral Small 2 on my own codebase?</strong>
Yes, under Apache 2.0 you have full rights to fine-tune the model. The recommended path is QLoRA fine-tuning via Unsloth or Axolotl — both support the Devstral Small 2 architecture and can run fine-tuning on a single 24GB GPU. A small dataset of 500–2,000 examples from your codebase is typically enough to meaningfully improve performance on domain-specific patterns. Unsloth&rsquo;s GGUF export pipeline lets you convert the fine-tuned model back to llama.cpp format for local deployment.</p>
]]></content:encoded></item></channel></rss>