<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Open-Weight on RockB</title><link>https://baeseokjae.github.io/tags/open-weight/</link><description>Recent content in Open-Weight on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 09 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/open-weight/index.xml" rel="self" type="application/rss+xml"/><item><title>Qwen 3.5 Coding Guide: Open-Weight Model That Rivals GPT-5</title><link>https://baeseokjae.github.io/posts/qwen-3-5-coding-guide-2026/</link><pubDate>Sat, 09 May 2026 00:00:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/qwen-3-5-coding-guide-2026/</guid><description>Complete guide to Qwen 3.5 Coder: benchmark performance, model sizes, self-hosting hardware requirements, VS Code integration, and cost-performance comparison with GPT-5.</description><content:encoded><![CDATA[<p>Qwen 3.5 Coder is Alibaba&rsquo;s latest open-weight code generation model family, spanning 0.5B to 72B parameters, and it is the first open-source coding model to come within 3-5% of GPT-5 on production benchmarks while carrying an Apache 2.0 license. For engineering teams burning $5–30 per million tokens on frontier API calls, that gap is closing fast enough to demand a hard look at the numbers.</p>
<h2 id="qwen-35-coder-2026-the-open-weight-model-closing-the-gap-on-gpt-5">Qwen 3.5 Coder 2026: The Open-Weight Model Closing the Gap on GPT-5</h2>
<p>Open-source AI coding model adoption grew 140% in 2025, reaching 2.3 million developers worldwide, and Qwen models alone accumulated 4.7 million downloads from Hugging Face in Q1 2026. That level of adoption is not driven by enthusiasm — it is driven by benchmark results that are forcing enterprises to reassess proprietary API spend. The Qwen 3.5 Coder 72B scores 61.8% on LiveCodeBench 2026, compared to GPT-5&rsquo;s 64.2%, a gap that narrows further on domain-specific tasks like web development and data science pipelines. Alibaba&rsquo;s release strategy is deliberate: the full model family ships under Apache 2.0 with no per-user fees, no usage caps, and no vendor lock-in. The architecture builds on Qwen2.5-Coder&rsquo;s proven transformer base, adding deeper code understanding through expanded training on GitHub repositories, competitive programming datasets, and documentation corpora across 90+ languages. For most engineering teams, the choice between Qwen 3.5 and GPT-5 is no longer a quality question — it is a cost and control question, and Qwen is winning on both dimensions for a growing share of workloads.</p>
<p>The competitive pressure Qwen 3.5 applies to the proprietary model market is structural, not incidental. When a 72B open-weight model scores within three percentage points of the world&rsquo;s best closed model on a live, contamination-resistant benchmark, the premium for closed access becomes very hard to justify at scale. Alibaba has committed to quarterly model updates through 2026, which means the gap will continue closing every three months while API pricing from OpenAI and Anthropic remains tied to per-token economics. For teams running millions of completions per day, this trajectory is decisive.</p>
<h2 id="model-sizes-and-variants-which-qwen-35-should-you-use">Model Sizes and Variants: Which Qwen 3.5 Should You Use?</h2>
<p>Qwen 3.5 Coder ships in seven size tiers — 0.5B, 1.5B, 4B, 7B, 14B, 32B, and 72B parameters — and selecting the right tier is the most consequential infrastructure decision you will make before deployment. The 70% of enterprises testing open-source coding models who cite cost reduction as their primary motivation should start at 7B: it delivers 59.2% on LiveCodeBench, runs on a single consumer GPU with 24GB VRAM, and handles the majority of autocomplete, docstring generation, and test scaffolding tasks without quality loss perceptible to developers. The 14B tier hits the sweet spot for teams that need stronger reasoning on multi-file refactors while staying within a single A100 80GB. The 32B is the right choice for agentic coding workflows where the model must plan, edit, and verify across a large codebase — it fits on two A100s and outperforms GPT-4o on most structured generation tasks. The 72B is reserved for teams that need near-frontier performance on complex algorithmic problems and can provision four A100 80GB GPUs or equivalent cloud hardware.</p>
<p>Edge deployments and developer laptops are served by the 0.5B, 1.5B, and 4B tiers, which are available as GGUF and GPTQ quantized checkpoints. These quantized variants run on Apple Silicon Macs and NVIDIA RTX 3090/4090 cards without meaningful accuracy degradation on standard completion tasks. If your primary use case is inline autocomplete in an editor, the 4B GGUF model delivers response latency under 150ms locally while consuming roughly 6GB of GPU memory. For CI/CD pipeline integration where latency is less critical than throughput, the 14B or 32B tier running on a dedicated inference server is the standard architecture.</p>
<table>
  <thead>
      <tr>
          <th>Model Size</th>
          <th>LiveCodeBench</th>
          <th>VRAM Required</th>
          <th>Primary Use Case</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0.5B / 1.5B</td>
          <td>~35-40%</td>
          <td>2–4 GB</td>
          <td>Edge / embedded</td>
      </tr>
      <tr>
          <td>4B</td>
          <td>~45%</td>
          <td>6 GB</td>
          <td>Local autocomplete</td>
      </tr>
      <tr>
          <td>7B</td>
          <td>59.2%</td>
          <td>14 GB</td>
          <td>Solo developer workflows</td>
      </tr>
      <tr>
          <td>14B</td>
          <td>~63%</td>
          <td>28 GB</td>
          <td>Team code review + refactor</td>
      </tr>
      <tr>
          <td>32B</td>
          <td>~67%</td>
          <td>2x A100 80GB</td>
          <td>Agentic pipelines</td>
      </tr>
      <tr>
          <td>72B</td>
          <td>61.8%</td>
          <td>4x A100 80GB</td>
          <td>Near-frontier tasks</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmark-performance-humaneval-livecodebench-and-real-world-coding">Benchmark Performance: HumanEval, LiveCodeBench, and Real-World Coding</h2>
<p>Qwen2.5-Coder scored 83.5% on HumanEval, outperforming GPT-4&rsquo;s 80.2% on the same benchmark — a result that made the model the most downloaded open-source coding checkpoint on Hugging Face for three consecutive months. HumanEval, however, is a solved benchmark: the problems are well-known, and training set contamination is a persistent concern for any model released in 2025 or later. LiveCodeBench is the more credible signal because it uses competitive programming problems published after each model&rsquo;s training cutoff, eliminating the possibility of memorization. On LiveCodeBench 2026, Qwen 3.5 Coder 72B scores 61.8% against GPT-5&rsquo;s 64.2% — a 2.4 percentage point gap that represents an extraordinary result for a freely available model. The 7B variant at 59.2% versus GPT-5&rsquo;s 62.1% is equally significant: a model you can run on a $500 GPU scores within three points of the most capable closed model available.</p>
<p>Real-world coding performance diverges from benchmark results in predictable ways. Qwen 3.5 excels at statically typed languages — TypeScript, Rust, Go, Java — where its training corpus is dense and the compiler provides unambiguous feedback. Performance on dynamic languages like Python and Ruby is strong but slightly less consistent on complex metaprogramming tasks. The 90+ language support is genuine: the model generates idiomatic COBOL, Fortran, and Haskell at a quality level that matches specialized fine-tuned models. For data science workflows involving Pandas, NumPy, and SQL generation, Qwen 3.5 matches or exceeds GPT-4o on accuracy and is significantly more consistent on schema-aware SQL generation against large database schemas.</p>
<h2 id="apache-20-license-what-free-commercial-use-actually-means">Apache 2.0 License: What Free Commercial Use Actually Means</h2>
<p>Apache 2.0 licensing grants unlimited commercial use, redistribution, and modification of the model weights with no per-user fees, no revenue sharing, and no requirement to open-source derivative products. For Qwen 3.5 Coder, this means a startup can fine-tune the 72B model on proprietary code, ship it as a commercial product, charge customers for access to the resulting system, and owe Alibaba nothing. That is a fundamentally different business model than the API-access paradigm where every token you generate contributes to OpenAI&rsquo;s or Anthropic&rsquo;s revenue. The 70% of enterprises testing open-source coding models who cite cost reduction as their primary motivation are responding rationally to this structure: a one-time hardware cost and ongoing electricity expense replaces an open-ended variable API bill that scales with usage.</p>
<p>The practical implications extend beyond cost. Apache 2.0 means your model weights are assets you own, not a service you subscribe to. You can pin to a specific checkpoint, audit every byte of the weights for security compliance, run the model in an air-gapped environment, and avoid the policy changes and deprecation schedules that closed API providers impose unilaterally. For enterprises in healthcare, finance, and government — where data residency and model auditability are regulatory requirements — Apache 2.0 licensing is not just a cost consideration, it is often the only legally compliant path. The license also allows you to build proprietary fine-tunes: teams running Qwen 3.5 on internal codebases consistently report 8-15% accuracy improvements on domain-specific tasks after fine-tuning on as few as 10,000 high-quality examples.</p>
<h2 id="self-hosting-qwen-35-hardware-requirements-and-setup-guide">Self-Hosting Qwen 3.5: Hardware Requirements and Setup Guide</h2>
<p>Running Qwen 3.5 Coder 72B in full precision requires four A100 80GB GPUs — a configuration available as a bare-metal server or as an on-demand cloud instance on AWS, GCP, and Lambda Labs. The 32B model runs on two A100 80GB GPUs and serves most agentic coding workflows without the four-GPU overhead. For teams that need the 72B capability without the four-GPU cost, GPTQ 4-bit quantization reduces memory requirements to approximately 40GB total VRAM, enabling deployment on two A100 40GB instances at a modest accuracy cost. The most cost-effective production deployment pattern for teams with variable load is spot/preemptible instances with a queue-based inference service: you pay on-demand rates only during peak hours and drain the queue on cheaper spot capacity during off-peak windows.</p>
<p>The recommended serving stack is vLLM with PagedAttention for production workloads. vLLM provides an OpenAI-compatible API endpoint out of the box, which means any tool or framework that supports OpenAI&rsquo;s API — including LangChain, LlamaIndex, and most IDE extensions — connects to a self-hosted Qwen 3.5 instance with a single endpoint URL change.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Install vLLM</span>
</span></span><span style="display:flex;"><span>pip install vllm
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Serve Qwen 3.5 Coder 32B with OpenAI-compatible API</span>
</span></span><span style="display:flex;"><span>python -m vllm.entrypoints.openai.api_server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model Qwen/Qwen3.5-Coder-32B-Instruct <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --tensor-parallel-size <span style="color:#ae81ff">2</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --max-model-len <span style="color:#ae81ff">32768</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --host 0.0.0.0 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --port <span style="color:#ae81ff">8000</span>
</span></span></code></pre></div><p>For the 72B model, set <code>--tensor-parallel-size 4</code>. For quantized GGUF models on consumer hardware, llama.cpp provides equivalent serving capability:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Serve 4-bit quantized 7B model with llama.cpp</span>
</span></span><span style="display:flex;"><span>./llama-server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model qwen3.5-coder-7b-q4_k_m.gguf <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --ctx-size <span style="color:#ae81ff">32768</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --n-gpu-layers <span style="color:#ae81ff">999</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --port <span style="color:#ae81ff">8080</span>
</span></span></code></pre></div><p>Ollama is the lowest-friction path for local development. Running <code>ollama pull qwen3.5-coder:7b</code> downloads the model and starts an OpenAI-compatible server in under five minutes on any system with 16GB of RAM and a supported GPU. For teams wanting a full inference management layer, Hugging Face&rsquo;s Text Generation Inference (TGI) and NVIDIA&rsquo;s Triton Inference Server are production-grade alternatives with metrics endpoints, batching controls, and Kubernetes operator support.</p>
<h2 id="integrating-qwen-35-with-vs-code-cursor-and-windsurf">Integrating Qwen 3.5 with VS Code, Cursor, and Windsurf</h2>
<p>All three major AI-native editors — VS Code, Cursor, and Windsurf — support Qwen 3.5 Coder through OpenAI-compatible API configuration, which is the critical architectural advantage of the vLLM and llama.cpp serving approaches. Qwen 3.5 Coder integrates with VS Code via the Continue extension, which is the most popular open-source AI coding assistant for VS Code with over 2 million installs. Once your Qwen instance is running on vLLM at <code>http://localhost:8000</code>, configuring Continue takes under two minutes.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span><span style="color:#75715e">// .continue/config.json
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;models&#34;</span>: [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;title&#34;</span>: <span style="color:#e6db74">&#34;Qwen 3.5 Coder 32B&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;provider&#34;</span>: <span style="color:#e6db74">&#34;openai&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;Qwen/Qwen3.5-Coder-32B-Instruct&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;apiBase&#34;</span>: <span style="color:#e6db74">&#34;http://localhost:8000/v1&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;apiKey&#34;</span>: <span style="color:#e6db74">&#34;not-required&#34;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  ],
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;tabAutocompleteModel&#34;</span>: {
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;title&#34;</span>: <span style="color:#e6db74">&#34;Qwen 3.5 Coder 7B (autocomplete)&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;provider&#34;</span>: <span style="color:#e6db74">&#34;openai&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;Qwen/Qwen3.5-Coder-7B-Instruct&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;apiBase&#34;</span>: <span style="color:#e6db74">&#34;http://localhost:8080/v1&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;apiKey&#34;</span>: <span style="color:#e6db74">&#34;not-required&#34;</span>
</span></span><span style="display:flex;"><span>  }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Cursor supports custom model endpoints under Settings &gt; Models &gt; Add Model. Point the base URL to your vLLM instance and set the model name to match the loaded checkpoint. Cursor&rsquo;s agent mode — where the model reads, edits, and runs code autonomously — works with Qwen 3.5 32B and 72B at quality levels close to GPT-4o for most refactoring and debugging tasks. Windsurf&rsquo;s Cascade feature similarly accepts custom OpenAI-compatible endpoints, making Qwen 3.5 a drop-in replacement for the default model in agentic flows.</p>
<p>The 128K context window in Qwen2.5-Coder (carried forward into the 3.5 series) is essential for these editor integrations: it allows the model to hold entire codebases in context during multi-file edits, which is the primary quality bottleneck for agentic coding workflows. Editor integrations that previously required GPT-4 Turbo&rsquo;s 128K window for large-file tasks now work equally well with a self-hosted Qwen 3.5 instance at zero marginal cost per completion.</p>
<h2 id="qwen-35-vs-gpt-5-vs-claude-opus-4-the-cost-performance-trade-off">Qwen 3.5 vs GPT-5 vs Claude Opus 4: The Cost-Performance Trade-off</h2>
<p>GPT-5 costs between $5 and $30 per million tokens depending on the tier — a range that translates to $500–$3,000 per day for a team making 100 million token requests daily. Qwen 3.5 Coder 72B running on four A100 80GB GPUs on Lambda Labs costs approximately $12/hour for on-demand capacity, or roughly $290/day for continuous 24-hour operation at full throughput. At 100 million tokens per day throughput, self-hosting Qwen 3.5 72B costs under $1/million tokens all-in — a 5-30x cost reduction against GPT-5 API pricing, with a 3-5% performance gap on benchmark tasks. For the 7B model on a single A100, the economics are even more dramatic: roughly $2/hour yields token costs well under $0.10/million for modest throughput workloads.</p>
<p>Claude Opus 4 occupies a different position: it offers superior reasoning on complex multi-step problems and excels at nuanced code review and architectural planning, but it carries API pricing comparable to GPT-5 and cannot be self-hosted. Claude Code — Anthropic&rsquo;s official CLI — is tightly integrated with Claude models and does not support external model endpoints, which means teams wanting the Claude Code workflow are locked into Claude API pricing. Qwen 3.5 can be integrated via any OpenAI-compatible client, giving teams full flexibility to build Claude Code-style agentic workflows without Anthropic&rsquo;s per-token pricing. The practical decision framework is straightforward: use GPT-5 or Claude Opus 4 for tasks where the 3-5% quality gap is decisive (complex algorithmic design, critical security code review), and use Qwen 3.5 for the 80-90% of coding tasks where the gap is imperceptible and cost is the dominant variable.</p>
<p>The total cost of ownership calculation must also include fine-tuning. Qwen 3.5&rsquo;s Apache 2.0 license enables fine-tuning on proprietary code at no additional licensing cost. A team that invests 40 GPU-hours in fine-tuning Qwen 3.5 14B on their internal codebase routinely closes the benchmark gap with frontier models on their specific domain — at which point the cost-performance calculus becomes unambiguous.</p>
<h2 id="who-should-use-qwen-35-coder">Who Should Use Qwen 3.5 Coder?</h2>
<p>The 2.3 million developers who adopted open-source coding models in 2025 are not a monolithic group, and Qwen 3.5 is not the right choice for every use case. It is, however, the right choice for a specific and large set of teams: those running high-volume coding automation where API costs are material, those with data residency or air-gap requirements that preclude cloud API usage, those who want to fine-tune on proprietary code without licensing complications, and those building products where the model is a core component rather than a third-party service. Startups building AI coding assistants, developer tools, or code search products should default to Qwen 3.5 as their foundation model — the Apache 2.0 license is the only commercially viable option when your product monetizes the model&rsquo;s output directly.</p>
<p>Enterprise engineering teams running internal developer productivity tools are the second major adopter group. If your team makes more than 50 million token requests per month — roughly 25 developers using an AI coding assistant heavily — the economics of self-hosting Qwen 3.5 on dedicated GPU infrastructure are favorable within 3-6 months of hardware cost payback. Teams under that threshold may find managed inference APIs for Qwen 3.5 (available through Together AI, Fireworks AI, and Replicate) offer the best balance of cost and operational simplicity. Individual developers who want a fully private, local coding assistant on a modern GPU laptop should use the Qwen 3.5 Coder 7B or 14B GGUF quantized models via Ollama — they are the most capable models available for purely local execution as of mid-2026.</p>
<hr>
<h2 id="frequently-asked-questions">Frequently Asked Questions</h2>
<p><strong>Q: Can I use Qwen 3.5 Coder commercially without paying any licensing fees?</strong>
Yes. Qwen 3.5 Coder is released under Apache 2.0, which grants unlimited commercial use, redistribution, and modification of the model weights. You can build and sell products powered by Qwen 3.5 without paying any licensing fees to Alibaba. You are responsible only for your infrastructure costs.</p>
<p><strong>Q: What is the minimum hardware to run Qwen 3.5 Coder locally for personal development?</strong>
The 7B GGUF quantized model runs on any system with a GPU carrying 14GB VRAM — an NVIDIA RTX 3090, 4090, or any recent Apple Silicon Mac with 16GB unified memory qualifies. The 4B model runs on 8GB VRAM. Both are available via Ollama with a single pull command and deliver sub-200ms autocomplete latency on modern hardware.</p>
<p><strong>Q: How does Qwen 3.5 perform on languages other than Python and JavaScript?</strong>
Qwen 3.5 Coder supports 90+ programming languages with genuine training coverage. Performance on Rust, Go, TypeScript, Java, and C++ is comparable to its Python performance on structured tasks. For specialized languages like COBOL, Fortran, and Haskell, Qwen 3.5 is the strongest open-weight option available and generates idiomatic code in these languages at a level that matches most proprietary models.</p>
<p><strong>Q: Can Qwen 3.5 Coder be fine-tuned on private codebases, and does Apache 2.0 allow this?</strong>
Yes on both counts. Apache 2.0 explicitly permits fine-tuning, and derivative models — including fine-tuned checkpoints trained on proprietary code — do not need to be open-sourced or shared. Standard fine-tuning approaches (LoRA, QLoRA, full fine-tuning) all work with Qwen 3.5 checkpoints. Teams routinely report 8-15% accuracy gains on domain-specific tasks after fine-tuning on 10,000+ high-quality examples.</p>
<p><strong>Q: How does Qwen 3.5 integrate with existing CI/CD pipelines and code review systems?</strong>
Qwen 3.5 exposes an OpenAI-compatible API when served via vLLM or llama.cpp, which means any tooling that supports OpenAI&rsquo;s API — GitHub Actions, GitLab CI, Jenkins plugins, LangChain-based review bots — connects to it with a single endpoint configuration change. For automated code review, the 32B or 72B model running as a persistent inference service handles concurrent PR review requests at production throughput without the latency variability of cloud API calls during peak hours.</p>
]]></content:encoded></item><item><title>Mistral Small 4 Review 2026: EU-Compliant, Open-Weight, $0.40/M Input</title><link>https://baeseokjae.github.io/posts/mistral-small-4-review-2026/</link><pubDate>Fri, 08 May 2026 00:00:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/mistral-small-4-review-2026/</guid><description>Mistral Small 4 is a 119B MoE model unifying reasoning, vision, and coding in a single open-weight release with full GDPR compliance and Apache 2.0 licensing.</description><content:encoded><![CDATA[<p>Mistral Small 4 ships as an Apache 2.0 open-weight model with 119B total parameters and only 6.5B active per token through a 128-expert Mixture-of-Experts architecture. It handles reasoning, vision, and coding through a single endpoint, replaces three separate Mistral models, and is priced at $0.40/M input tokens through the Mistral API.</p>
<h2 id="mistral-small-4-review-2026-the-eu-compliant-open-weight-model">Mistral Small 4 Review 2026: The EU-Compliant Open-Weight Model</h2>
<p>Mistral Small 4 scores 28 on the AA Intelligence Index and outperforms GPT-OSS 120B on LiveCodeBench while generating outputs that are 20% shorter — a combination that matters directly for production cost. Released by Mistral AI, a Paris-based company, the model inherits EU data residency by default: API traffic stays inside the European Union without any additional configuration, which makes it the first credible option for GDPR-sensitive workloads that do not want to negotiate Standard Contractual Clauses with US cloud providers. Beyond compliance, the Apache 2.0 license removes all royalty and usage restrictions, meaning the same weights can be fine-tuned, redistributed, and embedded in commercial products without legal overhead. The model replaces Magistral for reasoning tasks, Pixtral for vision tasks, and Devstral for code tasks. It achieves 40% lower end-to-end latency and 3x higher throughput compared to Mistral Small 3, which makes it viable not just as a quality upgrade but as a direct cost reduction for teams already running Mistral in production. The model ID on the Mistral API is <code>mistral-small-2603</code> and weights are available on Hugging Face at 242 GB in BF16.</p>
<h2 id="architecture-how-119b-parameters-with-65b-active-works">Architecture: How 119B Parameters with 6.5B Active Works</h2>
<p>The 119B total parameter count is the ceiling, not the runtime cost — each forward pass through Mistral Small 4 activates only 6.5B parameters because the model uses a Mixture-of-Experts (MoE) design with 128 specialist sub-networks, four of which handle any given token. This architecture is the same family as Mixtral 8x7B and Mixtral 8x22B, but scaled and retrained to cover multimodal inputs alongside text generation. The practical consequence is that inference compute scales with active parameters (6.5B), not total parameters (119B), which is why the model can deliver throughput 3x higher than Mistral Small 3 without requiring proportionally larger GPU clusters. The router — the component that selects which four experts handle each token — is trained end-to-end alongside the expert weights, so routing decisions are task-aware rather than fixed. Grouped-Query Attention (GQA) is applied across the architecture to reduce KV-cache memory pressure during long-context generation, which is critical for the 256K-token context window. The 256K window exceeds Claude Haiku 4.5 at 200K and GPT-4o Mini at 128K. The BF16 weight file totals 242 GB, which sets the floor for self-hosting memory requirements regardless of which GPU configuration is chosen. The <code>reasoning_effort</code> parameter — set to <code>low</code>, <code>medium</code>, or <code>high</code> — controls how many reasoning steps the model expands before producing output, giving engineers a direct handle on the cost-quality tradeoff within a single endpoint.</p>
<h2 id="reasoning-vision-and-coding-three-jobs-in-one-model">Reasoning, Vision, and Coding: Three Jobs in One Model</h2>
<p>Before Mistral Small 4, Mistral&rsquo;s product line split reasoning-heavy tasks to Magistral, image understanding to Pixtral, and code generation to Devstral — three separate model endpoints with separate pricing, separate version management, and separate integration overhead. Mistral Small 4 collapses all three into a single set of weights and a single API call, which simplifies architecture significantly for teams running mixed workloads. On coding specifically, the model reaches 92% on HumanEval while producing outputs 20% shorter than GPT-OSS 120B on LiveCodeBench, which is relevant because output token cost compounds at scale. Native image input supports document analysis, chart reading, and visual QA without routing to a secondary endpoint. The vision capability is not a bolt-on adapter — it is trained into the base model, which means image and text reasoning can interleave within a single context window. Reasoning depth is controlled through the <code>reasoning_effort</code> parameter: <code>low</code> for fast classification and routing tasks, <code>medium</code> for general generation, and <code>high</code> for multi-step debugging or proof-level reasoning. This parameter is set per API call, not at deployment time, so a single production deployment can serve all three workload profiles without spawning separate model instances. For teams that previously maintained separate fine-tunes or endpoints for code, vision, and reasoning tasks, consolidation onto Mistral Small 4 also reduces fine-tuning and evaluation surface area.</p>
<h2 id="benchmark-performance-livecodebench-intelligence-index-and-real-tasks">Benchmark Performance: LiveCodeBench, Intelligence Index, and Real Tasks</h2>
<p>Mistral Small 4 records an AA Intelligence Index composite score of 28, which positions it as the strongest open-weight small model across the tasks measured in that benchmark suite as of May 2026. On LiveCodeBench — a coding benchmark that evaluates models on problems posted after common training cutoffs, reducing data contamination risk — Mistral Small 4 outperforms GPT-OSS 120B despite having fewer active parameters, and does so with outputs that are 20% shorter on average. Shorter correct outputs matter because output tokens are priced higher than input tokens across all major providers; a model that answers concisely without sacrificing accuracy directly reduces billing. On HumanEval, the model scores 92%, matching Claude Haiku 3.5 and Qwen 2.5 72B on that benchmark. The AA Long-Context Reasoning (LCR) metric shows a more distinctive result: Mistral Small 4 achieves a score of 0.72 using approximately 1,600 output characters, while Qwen-series models reach comparable scores using 5,800 to 6,100 characters — a 3.5x to 4x verbosity gap that translates directly into output token cost differences. Throughput improvements over Mistral Small 3 are measured at 3x requests-per-second, and end-to-end latency is 40% lower, which matters for latency-sensitive applications like real-time code completion or interactive document analysis.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Mistral Small 4</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AA Intelligence Index</td>
          <td>28</td>
          <td>Composite score</td>
      </tr>
      <tr>
          <td>HumanEval</td>
          <td>92%</td>
          <td>Matches Haiku 3.5, Qwen 2.5 72B</td>
      </tr>
      <tr>
          <td>LiveCodeBench</td>
          <td>Beats GPT-OSS 120B</td>
          <td>20% shorter outputs</td>
      </tr>
      <tr>
          <td>AA LCR</td>
          <td>0.72 (1,600 chars)</td>
          <td>Qwen comparable score needs 5,800+ chars</td>
      </tr>
      <tr>
          <td>Latency vs Small 3</td>
          <td>-40%</td>
          <td>End-to-end completion time</td>
      </tr>
      <tr>
          <td>Throughput vs Small 3</td>
          <td>3x</td>
          <td>Requests per second</td>
      </tr>
  </tbody>
</table>
<h2 id="pricing-040m-input-vs-gpt-and-claude-alternatives">Pricing: $0.40/M Input vs GPT and Claude Alternatives</h2>
<p>At $0.40/M input tokens through the Mistral API — with $0.60/M output — Mistral Small 4 undercuts Claude Haiku 4.5 at $1.00/M input by a meaningful margin, and competes directly with GPT-4o Mini on cost while adding open-weight portability and EU data residency that GPT-4o Mini cannot match. Some API configurations and partner tiers list the input price as $0.15/M; the $0.40/M figure reflects standard public API pricing with reasoning capability enabled. Output is priced at $0.60/M across configurations. For a team processing 100M tokens per month on input, the difference between Mistral Small 4 and Claude Haiku 4.5 is $60 to $100 per month — meaningful at startup scale and significant at enterprise scale where monthly token volumes run into the billions. Qwen 2.5 72B is available through several inference providers at similar or lower cost, but the LCR verbosity gap means effective output cost is higher per task completed. GPT-4o Mini matches Mistral Small 4 on input price at some tiers but carries a proprietary license, US data routing by default, and no self-hosting option. The Apache 2.0 license on Mistral Small 4 means the pricing comparison for self-hosted deployments reduces entirely to GPU infrastructure cost, with zero model licensing fee added.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Input $/M</th>
          <th>Output $/M</th>
          <th>License</th>
          <th>EU Residency</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mistral Small 4</td>
          <td>$0.40</td>
          <td>$0.60</td>
          <td>Apache 2.0</td>
          <td>Default</td>
      </tr>
      <tr>
          <td>Claude Haiku 4.5</td>
          <td>$1.00</td>
          <td>$5.00</td>
          <td>Proprietary</td>
          <td>Requires config</td>
      </tr>
      <tr>
          <td>GPT-4o Mini</td>
          <td>$0.15–0.40</td>
          <td>$0.60</td>
          <td>Proprietary</td>
          <td>Requires config</td>
      </tr>
      <tr>
          <td>Qwen 2.5 72B</td>
          <td>varies</td>
          <td>varies</td>
          <td>Tongyi (restricted)</td>
          <td>No</td>
      </tr>
  </tbody>
</table>
<h2 id="eu-compliance-and-self-hosting-the-gdpr-advantage">EU Compliance and Self-Hosting: The GDPR Advantage</h2>
<p>Mistral AI is headquartered in Paris, and API data processed through the Mistral platform stays within the European Union by default — no transatlantic data transfer, no need to negotiate Standard Contractual Clauses with a US hyperscaler, and no reliance on adequacy decisions that can be challenged in court. For European enterprises in finance, healthcare, and legal services, this removes a structural compliance barrier that exists when using OpenAI or Anthropic APIs configured for US-region processing. The EU AI Act, phasing in through 2026, adds another layer of regulatory consideration: models processed through EU-based infrastructure and offered by an EU-based provider are easier to document for AI Act compliance purposes than models routed through third-country jurisdictions. The GDPR advantage is not theoretical — several German fintech and Dutch healthtech companies have adopted Mistral as their primary LLM provider specifically because EU data residency is the default, not an optional add-on requiring a separate enterprise agreement. For teams that need complete data isolation — including from the API provider itself — the Apache 2.0 license enables full on-premises deployment with no external data egress. The combination of EU-headquartered provider, EU-default API routing, and fully self-hostable open weights is unique among competitive-tier models as of May 2026.</p>
<h2 id="self-hosting-requirements-and-apache-20-licensing">Self-Hosting Requirements and Apache 2.0 Licensing</h2>
<p>Self-hosting Mistral Small 4 requires hardware that most teams do not have on-hand: at minimum, 4x NVIDIA HGX H100, or 2x HGX H200, or 1x DGX B200. The BF16 weights total 242 GB, setting the GPU VRAM floor before runtime overhead is added. Cloud rental of a 4x H100 SXM configuration runs approximately $25–32 per hour in 2026, or roughly $18,000–$23,000 per month for continuous operation. Purchasing equivalent hardware outright costs $200K–$300K for the GPU cluster alone, before networking, storage, and operational overhead. The Apache 2.0 license itself imposes no cost — commercial use, fine-tuning, redistribution, and embedding in proprietary products are all permitted without royalties or usage reporting. The license is also patent-permissive, which removes the patent retaliation clauses that complicate some open-source AI deployments in corporate legal contexts. The economic case for self-hosting becomes viable when monthly token volume exceeds roughly 10 billion tokens (at which point cloud API cost approaches or exceeds hardware rental cost) or when regulatory requirements mandate air-gapped infrastructure that cannot make outbound API calls at all. For most teams processing fewer than 10 billion tokens per month, the Mistral API at $0.40/M input is the lower-cost option after accounting for GPU rental, DevOps, and reliability engineering. EU-based cloud providers like OVHcloud, Deutsche Telekom Open Telekom Cloud, and Scaleway support the hardware configurations required for self-hosted deployment while maintaining EU data sovereignty end-to-end.</p>
<h2 id="who-should-use-mistral-small-4">Who Should Use Mistral Small 4?</h2>
<p>Mistral Small 4 earns a rating of 8.4 out of 10 as the strongest open-weight small model for combined reasoning, vision, and coding tasks in 2026, but the right fit depends on specific requirements rather than benchmark scores alone. Teams that benefit most are those with GDPR or EU AI Act obligations that make US-routed API calls legally risky — for these teams, Mistral Small 4 removes compliance friction that no amount of benchmark performance from OpenAI or Anthropic can resolve. Open-source projects and startups that need to embed an LLM into a commercial product without licensing encumbrance get that from Apache 2.0 with no workarounds required. Teams running mixed workloads — code generation, document analysis, and image understanding in the same product — benefit from consolidating onto a single model endpoint rather than maintaining separate integrations for each capability. Cost-sensitive API consumers processing hundreds of millions of tokens per month will see meaningful savings compared to Claude Haiku 4.5, and comparable or lower cost versus GPT-4o Mini with the added benefit of self-hosting optionality. Teams that should look elsewhere: those needing maximum reasoning depth on complex multi-step mathematical or logical problems, where frontier closed models like Claude Opus 4 or GPT-5 series still hold a clear lead; teams already deeply integrated into the OpenAI or Anthropic ecosystem where switching costs outweigh the pricing and licensing advantages; and teams requiring the most mature third-party tooling ecosystem for monitoring, fine-tuning pipelines, and observability, where OpenAI&rsquo;s ecosystem remains more developed. For everyone else, Mistral Small 4 is the default recommendation for open-weight production deployments in 2026.</p>
<hr>
<h2 id="faq">FAQ</h2>
<p><strong>What are the hardware requirements to self-host Mistral Small 4?</strong></p>
<p>Self-hosting requires a minimum of 4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200. The model weights in BF16 format total 242 GB, which sets the baseline GPU VRAM requirement before runtime overhead. Most teams without existing high-end GPU infrastructure will find the Mistral API more cost-effective unless they are processing more than 10 billion tokens per month or have regulatory requirements for fully air-gapped deployments.</p>
<p><strong>How does the reasoning_effort parameter affect billing?</strong></p>
<p>The <code>reasoning_effort</code> parameter (<code>low</code>, <code>medium</code>, <code>high</code>) controls how many internal reasoning steps the model expands before producing output. Higher effort levels generate more output tokens during the reasoning phase, which increases output token billing. For classification, routing, and simple generation tasks, <code>low</code> or <code>medium</code> keeps costs down. Reserve <code>high</code> for complex debugging, multi-step code generation, or tasks where answer quality justifies the additional output token cost.</p>
<p><strong>Does Mistral Small 4 handle images natively, or is it an add-on?</strong></p>
<p>Native image input is built into the base model — it is not a separate adapter or a secondary endpoint call. Document analysis, chart reading, and visual question answering are handled within the same context window as text. This replaces the previous Pixtral model for most image understanding tasks and eliminates the need to maintain a separate vision endpoint in production.</p>
<p><strong>Is Mistral Small 4 genuinely GDPR-compliant for processing personal data?</strong></p>
<p>Mistral AI is a French company and API traffic is processed within EU infrastructure by default, satisfying the geographic data residency requirement for most GDPR use cases. However, GDPR compliance also requires a signed Data Processing Agreement (DPA) with Mistral AI as your data processor — using EU-located infrastructure alone is not sufficient without a DPA in place. For workloads requiring complete data isolation from any third party, self-hosting under the Apache 2.0 license on EU-region infrastructure is the appropriate path.</p>
<p><strong>How does Mistral Small 4 compare to Qwen 2.5 72B on real-world cost?</strong></p>
<p>Benchmark scores between the two models are comparable on HumanEval, but Mistral Small 4&rsquo;s output verbosity advantage on the AA LCR metric is significant: Mistral achieves similar reasoning scores using roughly 1,600 output characters where Qwen requires 5,800 to 6,100 characters for equivalent results. Since output tokens are billed per token, Qwen&rsquo;s verbosity translates to 3.5x to 4x higher effective output cost per task for long-context reasoning workloads. For short-context tasks where both models are equally concise, the cost difference narrows, and Qwen may have pricing advantages through specific inference providers.</p>
]]></content:encoded></item></channel></rss>