Qwen 3 32B Local Setup Guide 2026: Run on a 24GB GPU

Qwen 3 32B Local Setup Guide 2026: Run on a 24GB GPU

Qwen3 32B fits on a single 24GB GPU using Q4_K_M quantization — it takes roughly 19.8GB VRAM, leaving ~4GB free for the KV cache. Install Ollama, run ollama pull qwen3:32b, and you have a frontier-class model running entirely on your hardware in under 10 minutes. What Is Qwen3 32B and Why Run It Locally? Qwen3 32B is the largest dense (non-MoE) model in Alibaba’s Qwen3 family, released in April 2026. Unlike the 235B MoE variant that demands multiple high-end GPUs, the 32B fits comfortably on consumer hardware at the right quantization level. The model scores competitively with Claude Sonnet 4.5 on coding benchmarks when run locally on an RTX 5070 at Q4 quantization (~40 tokens/sec), making it the most capable model that a single 24GB GPU can fully accelerate. At FP16 precision the model weighs ~64GB and needs ~64GB VRAM — far beyond a single consumer card. But at Q4_K_M quantization that drops to ~19.8GB, slotting neatly into a 24GB card with headroom to spare. Running it locally eliminates per-token API costs, keeps sensitive data on your machine, and removes rate-limit friction from high-throughput workloads. For developers who send thousands of requests per day, the break-even against cloud API pricing is typically under two months of GPU electricity costs. The 131K-token context window is fully supported locally, though longer contexts reduce throughput by 10–20% per doubling. ...

May 8, 2026 · 14 min · baeseokjae
Modal vs Replicate 2026: Best Serverless ML Deployment for Developers

Modal vs Replicate 2026: Best Serverless ML Deployment for Developers

Modal and Replicate are the two most-cited serverless ML deployment platforms in 2026, but they solve completely different problems. If you are an ML engineer building custom pipelines, Modal is the answer. If you are a full-stack developer who wants to call open-source models via a REST API in under an hour, Replicate is the answer. This guide cuts through the marketing to give you the data you need: cold start benchmarks, GPU throughput numbers, per-second pricing breakdowns, and a clear decision framework for which platform belongs in your stack. ...

May 8, 2026 · 13 min · baeseokjae
vLLM vs Ollama for Production LLM Serving in 2026

vLLM vs Ollama for Production LLM Serving in 2026: The Honest Comparison

Choosing between vLLM and Ollama for serving LLMs in production is not a matter of which tool is “better” — it is a matter of which tool solves the problem you actually have. vLLM serves 18.4 million Docker pulls and 2.79 million weekly PyPI downloads from teams running high-throughput inference APIs on GPU clusters. Ollama serves 126 million Docker pulls and 169,569 GitHub stars from developers running models locally on laptops and workstations. They overlap in capability but diverge sharply in architecture, performance characteristics, and production fitness. This guide compares them directly — with benchmarks, cost data, and a decision framework — so you can pick the right tool for your actual workload, not the one with more GitHub stars. ...

April 21, 2026 · 18 min · baeseokjae