<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Coding-Model on RockB</title><link>https://baeseokjae.github.io/tags/coding-model/</link><description>Recent content in Coding-Model on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 09 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/coding-model/index.xml" rel="self" type="application/rss+xml"/><item><title>Qwen 3.5 Coding Guide: Open-Weight Model That Rivals GPT-5</title><link>https://baeseokjae.github.io/posts/qwen-3-5-coding-guide-2026/</link><pubDate>Sat, 09 May 2026 00:00:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/qwen-3-5-coding-guide-2026/</guid><description>Complete guide to Qwen 3.5 Coder: benchmark performance, model sizes, self-hosting hardware requirements, VS Code integration, and cost-performance comparison with GPT-5.</description><content:encoded><![CDATA[<p>Qwen 3.5 Coder is Alibaba&rsquo;s latest open-weight code generation model family, spanning 0.5B to 72B parameters, and it is the first open-source coding model to come within 3-5% of GPT-5 on production benchmarks while carrying an Apache 2.0 license. For engineering teams burning $5–30 per million tokens on frontier API calls, that gap is closing fast enough to demand a hard look at the numbers.</p>
<h2 id="qwen-35-coder-2026-the-open-weight-model-closing-the-gap-on-gpt-5">Qwen 3.5 Coder 2026: The Open-Weight Model Closing the Gap on GPT-5</h2>
<p>Open-source AI coding model adoption grew 140% in 2025, reaching 2.3 million developers worldwide, and Qwen models alone accumulated 4.7 million downloads from Hugging Face in Q1 2026. That level of adoption is not driven by enthusiasm — it is driven by benchmark results that are forcing enterprises to reassess proprietary API spend. The Qwen 3.5 Coder 72B scores 61.8% on LiveCodeBench 2026, compared to GPT-5&rsquo;s 64.2%, a gap that narrows further on domain-specific tasks like web development and data science pipelines. Alibaba&rsquo;s release strategy is deliberate: the full model family ships under Apache 2.0 with no per-user fees, no usage caps, and no vendor lock-in. The architecture builds on Qwen2.5-Coder&rsquo;s proven transformer base, adding deeper code understanding through expanded training on GitHub repositories, competitive programming datasets, and documentation corpora across 90+ languages. For most engineering teams, the choice between Qwen 3.5 and GPT-5 is no longer a quality question — it is a cost and control question, and Qwen is winning on both dimensions for a growing share of workloads.</p>
<p>The competitive pressure Qwen 3.5 applies to the proprietary model market is structural, not incidental. When a 72B open-weight model scores within three percentage points of the world&rsquo;s best closed model on a live, contamination-resistant benchmark, the premium for closed access becomes very hard to justify at scale. Alibaba has committed to quarterly model updates through 2026, which means the gap will continue closing every three months while API pricing from OpenAI and Anthropic remains tied to per-token economics. For teams running millions of completions per day, this trajectory is decisive.</p>
<h2 id="model-sizes-and-variants-which-qwen-35-should-you-use">Model Sizes and Variants: Which Qwen 3.5 Should You Use?</h2>
<p>Qwen 3.5 Coder ships in seven size tiers — 0.5B, 1.5B, 4B, 7B, 14B, 32B, and 72B parameters — and selecting the right tier is the most consequential infrastructure decision you will make before deployment. The 70% of enterprises testing open-source coding models who cite cost reduction as their primary motivation should start at 7B: it delivers 59.2% on LiveCodeBench, runs on a single consumer GPU with 24GB VRAM, and handles the majority of autocomplete, docstring generation, and test scaffolding tasks without quality loss perceptible to developers. The 14B tier hits the sweet spot for teams that need stronger reasoning on multi-file refactors while staying within a single A100 80GB. The 32B is the right choice for agentic coding workflows where the model must plan, edit, and verify across a large codebase — it fits on two A100s and outperforms GPT-4o on most structured generation tasks. The 72B is reserved for teams that need near-frontier performance on complex algorithmic problems and can provision four A100 80GB GPUs or equivalent cloud hardware.</p>
<p>Edge deployments and developer laptops are served by the 0.5B, 1.5B, and 4B tiers, which are available as GGUF and GPTQ quantized checkpoints. These quantized variants run on Apple Silicon Macs and NVIDIA RTX 3090/4090 cards without meaningful accuracy degradation on standard completion tasks. If your primary use case is inline autocomplete in an editor, the 4B GGUF model delivers response latency under 150ms locally while consuming roughly 6GB of GPU memory. For CI/CD pipeline integration where latency is less critical than throughput, the 14B or 32B tier running on a dedicated inference server is the standard architecture.</p>
<table>
  <thead>
      <tr>
          <th>Model Size</th>
          <th>LiveCodeBench</th>
          <th>VRAM Required</th>
          <th>Primary Use Case</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>0.5B / 1.5B</td>
          <td>~35-40%</td>
          <td>2–4 GB</td>
          <td>Edge / embedded</td>
      </tr>
      <tr>
          <td>4B</td>
          <td>~45%</td>
          <td>6 GB</td>
          <td>Local autocomplete</td>
      </tr>
      <tr>
          <td>7B</td>
          <td>59.2%</td>
          <td>14 GB</td>
          <td>Solo developer workflows</td>
      </tr>
      <tr>
          <td>14B</td>
          <td>~63%</td>
          <td>28 GB</td>
          <td>Team code review + refactor</td>
      </tr>
      <tr>
          <td>32B</td>
          <td>~67%</td>
          <td>2x A100 80GB</td>
          <td>Agentic pipelines</td>
      </tr>
      <tr>
          <td>72B</td>
          <td>61.8%</td>
          <td>4x A100 80GB</td>
          <td>Near-frontier tasks</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmark-performance-humaneval-livecodebench-and-real-world-coding">Benchmark Performance: HumanEval, LiveCodeBench, and Real-World Coding</h2>
<p>Qwen2.5-Coder scored 83.5% on HumanEval, outperforming GPT-4&rsquo;s 80.2% on the same benchmark — a result that made the model the most downloaded open-source coding checkpoint on Hugging Face for three consecutive months. HumanEval, however, is a solved benchmark: the problems are well-known, and training set contamination is a persistent concern for any model released in 2025 or later. LiveCodeBench is the more credible signal because it uses competitive programming problems published after each model&rsquo;s training cutoff, eliminating the possibility of memorization. On LiveCodeBench 2026, Qwen 3.5 Coder 72B scores 61.8% against GPT-5&rsquo;s 64.2% — a 2.4 percentage point gap that represents an extraordinary result for a freely available model. The 7B variant at 59.2% versus GPT-5&rsquo;s 62.1% is equally significant: a model you can run on a $500 GPU scores within three points of the most capable closed model available.</p>
<p>Real-world coding performance diverges from benchmark results in predictable ways. Qwen 3.5 excels at statically typed languages — TypeScript, Rust, Go, Java — where its training corpus is dense and the compiler provides unambiguous feedback. Performance on dynamic languages like Python and Ruby is strong but slightly less consistent on complex metaprogramming tasks. The 90+ language support is genuine: the model generates idiomatic COBOL, Fortran, and Haskell at a quality level that matches specialized fine-tuned models. For data science workflows involving Pandas, NumPy, and SQL generation, Qwen 3.5 matches or exceeds GPT-4o on accuracy and is significantly more consistent on schema-aware SQL generation against large database schemas.</p>
<h2 id="apache-20-license-what-free-commercial-use-actually-means">Apache 2.0 License: What Free Commercial Use Actually Means</h2>
<p>Apache 2.0 licensing grants unlimited commercial use, redistribution, and modification of the model weights with no per-user fees, no revenue sharing, and no requirement to open-source derivative products. For Qwen 3.5 Coder, this means a startup can fine-tune the 72B model on proprietary code, ship it as a commercial product, charge customers for access to the resulting system, and owe Alibaba nothing. That is a fundamentally different business model than the API-access paradigm where every token you generate contributes to OpenAI&rsquo;s or Anthropic&rsquo;s revenue. The 70% of enterprises testing open-source coding models who cite cost reduction as their primary motivation are responding rationally to this structure: a one-time hardware cost and ongoing electricity expense replaces an open-ended variable API bill that scales with usage.</p>
<p>The practical implications extend beyond cost. Apache 2.0 means your model weights are assets you own, not a service you subscribe to. You can pin to a specific checkpoint, audit every byte of the weights for security compliance, run the model in an air-gapped environment, and avoid the policy changes and deprecation schedules that closed API providers impose unilaterally. For enterprises in healthcare, finance, and government — where data residency and model auditability are regulatory requirements — Apache 2.0 licensing is not just a cost consideration, it is often the only legally compliant path. The license also allows you to build proprietary fine-tunes: teams running Qwen 3.5 on internal codebases consistently report 8-15% accuracy improvements on domain-specific tasks after fine-tuning on as few as 10,000 high-quality examples.</p>
<h2 id="self-hosting-qwen-35-hardware-requirements-and-setup-guide">Self-Hosting Qwen 3.5: Hardware Requirements and Setup Guide</h2>
<p>Running Qwen 3.5 Coder 72B in full precision requires four A100 80GB GPUs — a configuration available as a bare-metal server or as an on-demand cloud instance on AWS, GCP, and Lambda Labs. The 32B model runs on two A100 80GB GPUs and serves most agentic coding workflows without the four-GPU overhead. For teams that need the 72B capability without the four-GPU cost, GPTQ 4-bit quantization reduces memory requirements to approximately 40GB total VRAM, enabling deployment on two A100 40GB instances at a modest accuracy cost. The most cost-effective production deployment pattern for teams with variable load is spot/preemptible instances with a queue-based inference service: you pay on-demand rates only during peak hours and drain the queue on cheaper spot capacity during off-peak windows.</p>
<p>The recommended serving stack is vLLM with PagedAttention for production workloads. vLLM provides an OpenAI-compatible API endpoint out of the box, which means any tool or framework that supports OpenAI&rsquo;s API — including LangChain, LlamaIndex, and most IDE extensions — connects to a self-hosted Qwen 3.5 instance with a single endpoint URL change.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Install vLLM</span>
</span></span><span style="display:flex;"><span>pip install vllm
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Serve Qwen 3.5 Coder 32B with OpenAI-compatible API</span>
</span></span><span style="display:flex;"><span>python -m vllm.entrypoints.openai.api_server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model Qwen/Qwen3.5-Coder-32B-Instruct <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --tensor-parallel-size <span style="color:#ae81ff">2</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --max-model-len <span style="color:#ae81ff">32768</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --host 0.0.0.0 <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --port <span style="color:#ae81ff">8000</span>
</span></span></code></pre></div><p>For the 72B model, set <code>--tensor-parallel-size 4</code>. For quantized GGUF models on consumer hardware, llama.cpp provides equivalent serving capability:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Serve 4-bit quantized 7B model with llama.cpp</span>
</span></span><span style="display:flex;"><span>./llama-server <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --model qwen3.5-coder-7b-q4_k_m.gguf <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --ctx-size <span style="color:#ae81ff">32768</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --n-gpu-layers <span style="color:#ae81ff">999</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  --port <span style="color:#ae81ff">8080</span>
</span></span></code></pre></div><p>Ollama is the lowest-friction path for local development. Running <code>ollama pull qwen3.5-coder:7b</code> downloads the model and starts an OpenAI-compatible server in under five minutes on any system with 16GB of RAM and a supported GPU. For teams wanting a full inference management layer, Hugging Face&rsquo;s Text Generation Inference (TGI) and NVIDIA&rsquo;s Triton Inference Server are production-grade alternatives with metrics endpoints, batching controls, and Kubernetes operator support.</p>
<h2 id="integrating-qwen-35-with-vs-code-cursor-and-windsurf">Integrating Qwen 3.5 with VS Code, Cursor, and Windsurf</h2>
<p>All three major AI-native editors — VS Code, Cursor, and Windsurf — support Qwen 3.5 Coder through OpenAI-compatible API configuration, which is the critical architectural advantage of the vLLM and llama.cpp serving approaches. Qwen 3.5 Coder integrates with VS Code via the Continue extension, which is the most popular open-source AI coding assistant for VS Code with over 2 million installs. Once your Qwen instance is running on vLLM at <code>http://localhost:8000</code>, configuring Continue takes under two minutes.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-json" data-lang="json"><span style="display:flex;"><span><span style="color:#75715e">// .continue/config.json
</span></span></span><span style="display:flex;"><span><span style="color:#75715e"></span>{
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;models&#34;</span>: [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;title&#34;</span>: <span style="color:#e6db74">&#34;Qwen 3.5 Coder 32B&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;provider&#34;</span>: <span style="color:#e6db74">&#34;openai&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;Qwen/Qwen3.5-Coder-32B-Instruct&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;apiBase&#34;</span>: <span style="color:#e6db74">&#34;http://localhost:8000/v1&#34;</span>,
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">&#34;apiKey&#34;</span>: <span style="color:#e6db74">&#34;not-required&#34;</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>  ],
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">&#34;tabAutocompleteModel&#34;</span>: {
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;title&#34;</span>: <span style="color:#e6db74">&#34;Qwen 3.5 Coder 7B (autocomplete)&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;provider&#34;</span>: <span style="color:#e6db74">&#34;openai&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;Qwen/Qwen3.5-Coder-7B-Instruct&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;apiBase&#34;</span>: <span style="color:#e6db74">&#34;http://localhost:8080/v1&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">&#34;apiKey&#34;</span>: <span style="color:#e6db74">&#34;not-required&#34;</span>
</span></span><span style="display:flex;"><span>  }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>Cursor supports custom model endpoints under Settings &gt; Models &gt; Add Model. Point the base URL to your vLLM instance and set the model name to match the loaded checkpoint. Cursor&rsquo;s agent mode — where the model reads, edits, and runs code autonomously — works with Qwen 3.5 32B and 72B at quality levels close to GPT-4o for most refactoring and debugging tasks. Windsurf&rsquo;s Cascade feature similarly accepts custom OpenAI-compatible endpoints, making Qwen 3.5 a drop-in replacement for the default model in agentic flows.</p>
<p>The 128K context window in Qwen2.5-Coder (carried forward into the 3.5 series) is essential for these editor integrations: it allows the model to hold entire codebases in context during multi-file edits, which is the primary quality bottleneck for agentic coding workflows. Editor integrations that previously required GPT-4 Turbo&rsquo;s 128K window for large-file tasks now work equally well with a self-hosted Qwen 3.5 instance at zero marginal cost per completion.</p>
<h2 id="qwen-35-vs-gpt-5-vs-claude-opus-4-the-cost-performance-trade-off">Qwen 3.5 vs GPT-5 vs Claude Opus 4: The Cost-Performance Trade-off</h2>
<p>GPT-5 costs between $5 and $30 per million tokens depending on the tier — a range that translates to $500–$3,000 per day for a team making 100 million token requests daily. Qwen 3.5 Coder 72B running on four A100 80GB GPUs on Lambda Labs costs approximately $12/hour for on-demand capacity, or roughly $290/day for continuous 24-hour operation at full throughput. At 100 million tokens per day throughput, self-hosting Qwen 3.5 72B costs under $1/million tokens all-in — a 5-30x cost reduction against GPT-5 API pricing, with a 3-5% performance gap on benchmark tasks. For the 7B model on a single A100, the economics are even more dramatic: roughly $2/hour yields token costs well under $0.10/million for modest throughput workloads.</p>
<p>Claude Opus 4 occupies a different position: it offers superior reasoning on complex multi-step problems and excels at nuanced code review and architectural planning, but it carries API pricing comparable to GPT-5 and cannot be self-hosted. Claude Code — Anthropic&rsquo;s official CLI — is tightly integrated with Claude models and does not support external model endpoints, which means teams wanting the Claude Code workflow are locked into Claude API pricing. Qwen 3.5 can be integrated via any OpenAI-compatible client, giving teams full flexibility to build Claude Code-style agentic workflows without Anthropic&rsquo;s per-token pricing. The practical decision framework is straightforward: use GPT-5 or Claude Opus 4 for tasks where the 3-5% quality gap is decisive (complex algorithmic design, critical security code review), and use Qwen 3.5 for the 80-90% of coding tasks where the gap is imperceptible and cost is the dominant variable.</p>
<p>The total cost of ownership calculation must also include fine-tuning. Qwen 3.5&rsquo;s Apache 2.0 license enables fine-tuning on proprietary code at no additional licensing cost. A team that invests 40 GPU-hours in fine-tuning Qwen 3.5 14B on their internal codebase routinely closes the benchmark gap with frontier models on their specific domain — at which point the cost-performance calculus becomes unambiguous.</p>
<h2 id="who-should-use-qwen-35-coder">Who Should Use Qwen 3.5 Coder?</h2>
<p>The 2.3 million developers who adopted open-source coding models in 2025 are not a monolithic group, and Qwen 3.5 is not the right choice for every use case. It is, however, the right choice for a specific and large set of teams: those running high-volume coding automation where API costs are material, those with data residency or air-gap requirements that preclude cloud API usage, those who want to fine-tune on proprietary code without licensing complications, and those building products where the model is a core component rather than a third-party service. Startups building AI coding assistants, developer tools, or code search products should default to Qwen 3.5 as their foundation model — the Apache 2.0 license is the only commercially viable option when your product monetizes the model&rsquo;s output directly.</p>
<p>Enterprise engineering teams running internal developer productivity tools are the second major adopter group. If your team makes more than 50 million token requests per month — roughly 25 developers using an AI coding assistant heavily — the economics of self-hosting Qwen 3.5 on dedicated GPU infrastructure are favorable within 3-6 months of hardware cost payback. Teams under that threshold may find managed inference APIs for Qwen 3.5 (available through Together AI, Fireworks AI, and Replicate) offer the best balance of cost and operational simplicity. Individual developers who want a fully private, local coding assistant on a modern GPU laptop should use the Qwen 3.5 Coder 7B or 14B GGUF quantized models via Ollama — they are the most capable models available for purely local execution as of mid-2026.</p>
<hr>
<h2 id="frequently-asked-questions">Frequently Asked Questions</h2>
<p><strong>Q: Can I use Qwen 3.5 Coder commercially without paying any licensing fees?</strong>
Yes. Qwen 3.5 Coder is released under Apache 2.0, which grants unlimited commercial use, redistribution, and modification of the model weights. You can build and sell products powered by Qwen 3.5 without paying any licensing fees to Alibaba. You are responsible only for your infrastructure costs.</p>
<p><strong>Q: What is the minimum hardware to run Qwen 3.5 Coder locally for personal development?</strong>
The 7B GGUF quantized model runs on any system with a GPU carrying 14GB VRAM — an NVIDIA RTX 3090, 4090, or any recent Apple Silicon Mac with 16GB unified memory qualifies. The 4B model runs on 8GB VRAM. Both are available via Ollama with a single pull command and deliver sub-200ms autocomplete latency on modern hardware.</p>
<p><strong>Q: How does Qwen 3.5 perform on languages other than Python and JavaScript?</strong>
Qwen 3.5 Coder supports 90+ programming languages with genuine training coverage. Performance on Rust, Go, TypeScript, Java, and C++ is comparable to its Python performance on structured tasks. For specialized languages like COBOL, Fortran, and Haskell, Qwen 3.5 is the strongest open-weight option available and generates idiomatic code in these languages at a level that matches most proprietary models.</p>
<p><strong>Q: Can Qwen 3.5 Coder be fine-tuned on private codebases, and does Apache 2.0 allow this?</strong>
Yes on both counts. Apache 2.0 explicitly permits fine-tuning, and derivative models — including fine-tuned checkpoints trained on proprietary code — do not need to be open-sourced or shared. Standard fine-tuning approaches (LoRA, QLoRA, full fine-tuning) all work with Qwen 3.5 checkpoints. Teams routinely report 8-15% accuracy gains on domain-specific tasks after fine-tuning on 10,000+ high-quality examples.</p>
<p><strong>Q: How does Qwen 3.5 integrate with existing CI/CD pipelines and code review systems?</strong>
Qwen 3.5 exposes an OpenAI-compatible API when served via vLLM or llama.cpp, which means any tooling that supports OpenAI&rsquo;s API — GitHub Actions, GitLab CI, Jenkins plugins, LangChain-based review bots — connects to it with a single endpoint configuration change. For automated code review, the 32B or 72B model running as a persistent inference service handles concurrent PR review requests at production throughput without the latency variability of cloud API calls during peak hours.</p>
]]></content:encoded></item></channel></rss>