<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Ray Serve Autoscaling on RockB</title><link>https://baeseokjae.github.io/tags/ray-serve-autoscaling/</link><description>Recent content in Ray Serve Autoscaling on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 10 Apr 2026 14:13:00 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/ray-serve-autoscaling/index.xml" rel="self" type="application/rss+xml"/><item><title>Local AI Model Serving Frameworks 2026: vLLM vs TGI vs Ray Serve Compared</title><link>https://baeseokjae.github.io/posts/local-ai-model-serving-frameworks-2026/</link><pubDate>Fri, 10 Apr 2026 14:13:00 +0000</pubDate><guid>https://baeseokjae.github.io/posts/local-ai-model-serving-frameworks-2026/</guid><description>vLLM leads high-concurrency APIs, SGLang excels in multi-turn chat, Ray Serve adds enterprise orchestration, and TGI is in maintenance mode as of 2026.</description><content:encoded><![CDATA[<p>In 2026, <strong>vLLM is the production standard</strong> for local AI model serving, delivering 14–24× higher throughput than naive HuggingFace Transformers serving. SGLang edges ahead on pure batch inference benchmarks, Ray Serve adds enterprise-grade orchestration on top of vLLM, and TGI entered maintenance mode in December 2025—making the framework landscape clearer than ever for developers choosing where to invest.</p>
<hr>
<h2 id="why-does-local-ai-model-serving-matter-more-than-ever-in-2026">Why Does Local AI Model Serving Matter More Than Ever in 2026?</h2>
<p>The on-premise LLM serving platforms market reached <strong>$3.81 billion in 2026</strong>, up from $3.08 billion in 2025, and is projected to hit <strong>$9.03 billion by 2030</strong> at a CAGR of 24.1% (The Business Research Company, 2026). Two forces are driving this growth:</p>
<ol>
<li><strong>Data-privacy regulations</strong> — GDPR, the EU AI Act, and emerging US state-level laws are pushing enterprises to keep inference workloads on-premise rather than sending sensitive data to cloud providers.</li>
<li><strong>Cost optimization</strong> — GPU spot instances on major clouds have become volatile; organizations with on-premise A100/H100 clusters find fully amortized inference far cheaper at scale.</li>
</ol>
<p>The result: teams that previously outsourced inference to OpenAI or Anthropic are standing up internal serving infrastructure, and choosing the right framework has become a strategic engineering decision.</p>
<hr>
<h2 id="what-are-the-main-local-ai-model-serving-frameworks-in-2026">What Are the Main Local AI Model Serving Frameworks in 2026?</h2>
<p>The landscape has consolidated around four frameworks, each with a distinct strength:</p>
<table>
  <thead>
      <tr>
          <th>Framework</th>
          <th>Primary Strength</th>
          <th>Status in 2026</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>vLLM</strong></td>
          <td>High-concurrency API serving</td>
          <td>Production standard</td>
      </tr>
      <tr>
          <td><strong>SGLang</strong></td>
          <td>Multi-turn chat / agentic workloads</td>
          <td>Fastest growing</td>
      </tr>
      <tr>
          <td><strong>Ray Serve</strong></td>
          <td>Enterprise orchestration, multi-model</td>
          <td>Mature, complementary to vLLM</td>
      </tr>
      <tr>
          <td><strong>TGI (Text Generation Inference)</strong></td>
          <td>Hugging Face ecosystem integration</td>
          <td>Maintenance mode</td>
      </tr>
      <tr>
          <td><strong>Triton + TensorRT-LLM</strong></td>
          <td>Maximum NVIDIA-optimized throughput</td>
          <td>Enterprise / complex setup</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="how-does-vllm-achieve-its-industry-leading-throughput">How Does vLLM Achieve Its Industry-Leading Throughput?</h2>
<h3 id="pagedattention-the-core-innovation">PagedAttention: The Core Innovation</h3>
<p>vLLM&rsquo;s <strong>PagedAttention</strong> mechanism manages the KV (key-value) cache similarly to how operating system virtual memory manages RAM pages. Rather than pre-allocating a contiguous block of GPU memory per request—which wastes 60–80% of reserved VRAM through internal fragmentation—PagedAttention stores KV cache in non-contiguous physical blocks and maps them through a virtual page table.</p>
<p>The practical result:</p>
<ul>
<li><strong>85–92% GPU utilization</strong> under high concurrency (Prem AI benchmarking, March 2026)</li>
<li><strong>2–4× higher tokens/second</strong> throughput than naive HuggingFace Transformers serving</li>
<li>Support for significantly larger batch sizes on the same hardware</li>
</ul>
<h3 id="dynamic-multi-lora-serving">Dynamic Multi-LoRA Serving</h3>
<p>A major 2026 differentiator: vLLM supports <strong>dynamic multi-LoRA serving</strong>, allowing a single server process to switch between dozens of fine-tuned LoRA adapters at request time without reloading the base model. This makes vLLM the go-to choice for platforms that need to serve different personas or domain-tuned variants of a model from a single GPU cluster.</p>
<h3 id="openai-compatible-api">OpenAI-Compatible API</h3>
<p>vLLM exposes a fully OpenAI-compatible REST API (<code>/v1/completions</code>, <code>/v1/chat/completions</code>, <code>/v1/embeddings</code>), meaning existing applications written against the OpenAI SDK can be redirected to a local vLLM endpoint by changing a single environment variable.</p>
<hr>
<h2 id="is-tgi-still-worth-using-in-2026">Is TGI Still Worth Using in 2026?</h2>
<h3 id="tgis-maintenance-mode-announcement">TGI&rsquo;s Maintenance Mode Announcement</h3>
<p>In <strong>December 2025</strong>, Hugging Face announced that TGI (Text Generation Inference) was entering <strong>maintenance mode</strong>. The Hugging Face team now officially recommends <strong>vLLM or SGLang</strong> for new production deployments. Existing TGI deployments will continue to receive critical security patches but no new feature development.</p>
<p>This is a significant inflection point. Teams that built their serving stack on TGI need a migration plan.</p>
<h3 id="when-tgi-still-makes-sense">When TGI Still Makes Sense</h3>
<p>Despite maintenance mode, TGI retains a narrow set of use cases where migration cost outweighs switching benefit:</p>
<ul>
<li><strong>Hugging Face Inference Endpoints</strong> — If your team uses HF&rsquo;s managed cloud inference product, TGI is still the backend and you get its HF ecosystem integration (automatic model download, gated model authentication) for free.</li>
<li><strong>Existing stable deployments</strong> — If you are running TGI serving a non-critical model and it is not hitting throughput bottlenecks, the operational risk of migration may not justify immediate action.</li>
</ul>
<h3 id="migration-path-from-tgi-to-vllm">Migration Path from TGI to vLLM</h3>
<p>The API surface is compatible: both expose OpenAI-format endpoints and accept <code>model</code>, <code>messages</code>, <code>max_tokens</code>, and <code>temperature</code> parameters in the same structure. The main migration steps are:</p>
<ol>
<li>Replace the Docker image (<code>ghcr.io/huggingface/text-generation-inference</code> → <code>vllm/vllm-openai</code>)</li>
<li>Update engine arguments (<code>--model-id</code> → <code>--model</code>, <code>--num-shard</code> → <code>--tensor-parallel-size</code>)</li>
<li>Update authentication headers if using HF gated models (vLLM uses <code>HUGGING_FACE_HUB_TOKEN</code>)</li>
<li>Validate throughput under load—most teams see a 30–60% throughput improvement post-migration</li>
</ol>
<hr>
<h2 id="how-does-sglang-compare-to-vllm-for-multi-turn-workloads">How Does SGLang Compare to vLLM for Multi-Turn Workloads?</h2>
<h3 id="radixattention-prefix-caching-at-scale">RadixAttention: Prefix Caching at Scale</h3>
<p>SGLang&rsquo;s headline innovation is <strong>RadixAttention</strong>, a cache management system that stores KV cache entries in a radix tree indexed by token prefix hashes. When a new request shares a common prefix with a previous request—as is common in multi-turn conversations and agentic chains of thought—SGLang can reuse the cached KV values instead of recomputing them.</p>
<p>The measured result: <strong>85–95% cache hit rates</strong> on multi-turn chat workloads, which directly translates to reduced latency for follow-up turns in a conversation.</p>
<h3 id="benchmark-numbers-sglang-vs-vllm">Benchmark Numbers: SGLang vs vLLM</h3>
<p>On H100 GPU hardware (Prem AI benchmarking, March 2026):</p>
<table>
  <thead>
      <tr>
          <th>Workload</th>
          <th>SGLang</th>
          <th>vLLM</th>
          <th>Delta</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Batch inference (tokens/sec)</td>
          <td>16,215</td>
          <td>12,553</td>
          <td>+29% SGLang</td>
      </tr>
      <tr>
          <td>Multi-turn chat (tokens/sec)</td>
          <td>~14,800</td>
          <td>~11,200</td>
          <td>+32% SGLang</td>
      </tr>
      <tr>
          <td>Single-request latency</td>
          <td>Comparable</td>
          <td>Comparable</td>
          <td>Tie</td>
      </tr>
      <tr>
          <td>GPU utilization (high concurrency)</td>
          <td>88–93%</td>
          <td>85–92%</td>
          <td>Similar</td>
      </tr>
  </tbody>
</table>
<p>SGLang&rsquo;s advantage is most pronounced on <strong>batch inference and multi-turn workloads</strong>. For single-request latency-optimized scenarios (e.g., interactive coding assistants with no conversation history), vLLM remains competitive.</p>
<h3 id="when-to-choose-sglang">When to Choose SGLang</h3>
<ul>
<li><strong>Agentic pipelines</strong> — LLM agents that make multiple model calls per user action benefit enormously from prefix caching; the system prompt and conversation history are reused across calls.</li>
<li><strong>Chatbot platforms</strong> — Long conversation threads with consistent system prompts are exactly the workload RadixAttention was designed for.</li>
<li><strong>Batch inference jobs</strong> — Offline batch scoring of large document sets with shared prefixes.</li>
</ul>
<hr>
<h2 id="what-does-ray-serve-add-to-the-equation">What Does Ray Serve Add to the Equation?</h2>
<h3 id="ray-serve-as-an-orchestration-layer">Ray Serve as an Orchestration Layer</h3>
<p>Ray Serve is not a replacement for vLLM—it is an <strong>orchestration layer</strong> that runs vLLM (or other backends) as deployment replicas and adds production-grade infrastructure concerns:</p>
<ul>
<li><strong>Autoscaling</strong> — Scale replicas up/down based on request queue depth, target latency, or custom metrics. vLLM alone does not autoscale; Ray Serve wraps it with Kubernetes-aware horizontal pod autoscaling logic.</li>
<li><strong>Multi-model serving</strong> — Route traffic across multiple models from a single entry point. A Ray Serve deployment can host <code>llama-3.1-70b</code> for complex queries and <code>llama-3.2-3b</code> for simple classification tasks behind a unified endpoint.</li>
<li><strong>Advanced routing</strong> — Implement A/B testing, canary rollouts, or semantic routing (route to different models based on query classification) without modifying client code.</li>
<li><strong>Zero-downtime model swaps</strong> — Rolling update replicas while keeping the endpoint live.</li>
</ul>
<h3 id="ray-serve--vllm-compatibility">Ray Serve + vLLM Compatibility</h3>
<p>Ray Serve 2.54+ exposes an OpenAI-compatible LLM serving API that accepts the same <code>vllm serve</code> engine arguments. The compatibility layer means:</p>
<ol>
<li>Start with <code>vllm serve</code> locally for development</li>
<li>Deploy to Ray Serve in production with no application code changes</li>
<li>Add autoscaling configuration declaratively in <code>serve_config.yaml</code></li>
</ol>
<p>This migration path makes Ray Serve the natural graduation path for teams whose vLLM deployment outgrows single-node or single-process constraints.</p>
<hr>
<h2 id="how-does-tensorrt-llm-fit-into-the-2026-landscape">How Does TensorRT-LLM Fit into the 2026 Landscape?</h2>
<h3 id="maximum-performance-maximum-complexity">Maximum Performance, Maximum Complexity</h3>
<p>NVIDIA&rsquo;s <strong>TensorRT-LLM</strong> (typically deployed via the Triton Inference Server) offers the highest raw throughput of any framework on NVIDIA hardware—but at a cost: <strong>setup complexity that is an order of magnitude higher</strong> than vLLM or SGLang.</p>
<p>TensorRT-LLM requires:</p>
<ul>
<li>Compiling model weights into TensorRT engine files (a process that can take hours for large models)</li>
<li>NVIDIA-specific GPU hardware (no AMD/CPU fallback)</li>
<li>Familiarity with Triton model repository structure and configuration files</li>
<li>Separate tooling for quantization (INT4/INT8/FP8 optimization)</li>
</ul>
<p>The payoff is genuine: TensorRT-LLM routinely achieves 20–40% better tokens/sec than vLLM on equivalent NVIDIA hardware for FP16 workloads, and significantly more with FP8 quantization.</p>
<h3 id="when-tensorrt-llm-is-worth-the-overhead">When TensorRT-LLM Is Worth the Overhead</h3>
<ul>
<li><strong>Enterprise multi-model inference pipelines</strong> that have a dedicated MLOps team to manage the build-and-deploy lifecycle</li>
<li><strong>High-volume production APIs</strong> where every percentage point of throughput improvement translates to meaningful cost savings at scale</li>
<li><strong>NVIDIA DGX or HGX clusters</strong> where NVIDIA support contracts and tooling are already part of the infrastructure investment</li>
</ul>
<hr>
<h2 id="which-framework-should-you-choose-a-decision-framework-for-2026">Which Framework Should You Choose? A Decision Framework for 2026</h2>
<table>
  <thead>
      <tr>
          <th>Requirement</th>
          <th>Best Framework</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>High-concurrency REST API (OpenAI drop-in)</td>
          <td><strong>vLLM</strong></td>
      </tr>
      <tr>
          <td>Multi-turn chat / agentic LLM pipelines</td>
          <td><strong>SGLang</strong></td>
      </tr>
      <tr>
          <td>Enterprise autoscaling, multi-model routing</td>
          <td><strong>Ray Serve + vLLM</strong></td>
      </tr>
      <tr>
          <td>Maximum NVIDIA-optimized throughput</td>
          <td><strong>TensorRT-LLM + Triton</strong></td>
      </tr>
      <tr>
          <td>HF Inference Endpoints (managed)</td>
          <td><strong>TGI</strong> (until migrated)</td>
      </tr>
      <tr>
          <td>Batch offline inference at scale</td>
          <td><strong>SGLang</strong></td>
      </tr>
      <tr>
          <td>Simplest possible local dev setup</td>
          <td><strong>vLLM</strong> (<code>pip install vllm; vllm serve model-id</code>)</td>
      </tr>
  </tbody>
</table>
<h3 id="the-pragmatic-2026-decision-tree">The Pragmatic 2026 Decision Tree</h3>
<ol>
<li><strong>Are you already on HF Inference Endpoints?</strong> → Stay on TGI for now, plan migration to vLLM within 12 months.</li>
<li><strong>Are you building a chatbot or agentic pipeline?</strong> → Evaluate SGLang; RadixAttention prefix caching will save you GPU hours.</li>
<li><strong>Do you need horizontal scaling across multiple nodes or models?</strong> → Start with vLLM, front it with Ray Serve.</li>
<li><strong>Do you have NVIDIA enterprise hardware and an MLOps team?</strong> → Benchmark TensorRT-LLM; the performance gains may justify the complexity.</li>
<li><strong>Everything else</strong> → vLLM is the correct default choice.</li>
</ol>
<hr>
<h2 id="what-performance-should-you-expect-in-practice">What Performance Should You Expect in Practice?</h2>
<h3 id="hardware-baselines-h100-sxm5-april-2026">Hardware Baselines (H100 SXM5, April 2026)</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Framework</th>
          <th>Throughput (tokens/sec)</th>
          <th>GPU Util</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Llama-3.1-70B (FP16)</td>
          <td>vLLM</td>
          <td>12,553</td>
          <td>89%</td>
      </tr>
      <tr>
          <td>Llama-3.1-70B (FP16)</td>
          <td>SGLang</td>
          <td>16,215</td>
          <td>91%</td>
      </tr>
      <tr>
          <td>Llama-3.1-70B (FP8)</td>
          <td>TensorRT-LLM</td>
          <td>~18,500</td>
          <td>95%</td>
      </tr>
      <tr>
          <td>Llama-3.2-8B (FP16)</td>
          <td>vLLM</td>
          <td>47,200</td>
          <td>86%</td>
      </tr>
      <tr>
          <td>Llama-3.2-8B (FP16)</td>
          <td>SGLang</td>
          <td>52,800</td>
          <td>90%</td>
      </tr>
  </tbody>
</table>
<p><em>Sources: Prem AI benchmarking March 2026; TensorRT-LLM figure is author estimate based on published FP8 uplift ratios.</em></p>
<h3 id="latency-characteristics">Latency Characteristics</h3>
<p>For interactive applications, <strong>time-to-first-token (TTFT)</strong> matters as much as throughput. Both vLLM and SGLang achieve sub-100ms TTFT for 8B models on H100 hardware at moderate concurrency. TensorRT-LLM is typically 10–20% faster on TTFT due to kernel-level optimizations but within the same order of magnitude.</p>
<hr>
<h2 id="what-are-the-future-trends-in-local-ai-model-serving">What Are the Future Trends in Local AI Model Serving?</h2>
<h3 id="speculative-decoding-goes-mainstream">Speculative Decoding Goes Mainstream</h3>
<p>Both vLLM and SGLang have integrated <strong>speculative decoding</strong> support in 2026. By using a small draft model to propose token sequences and validating them in parallel with the large target model, speculative decoding reduces latency by 2–3× on typical text generation tasks with no accuracy loss.</p>
<h3 id="multi-modal-serving">Multi-Modal Serving</h3>
<p>All major frameworks now support <strong>vision-language models</strong> (VLMs): vLLM, SGLang, and Ray Serve can serve Llama 4, Qwen2-VL, and similar multimodal checkpoints with the same OpenAI-compatible API. The <code>/v1/chat/completions</code> endpoint accepts image inputs via the messages array, enabling drop-in multimodal inference.</p>
<h3 id="edge-deployment-frameworks">Edge Deployment Frameworks</h3>
<p>A separate category is emerging for <strong>edge inference</strong>: frameworks like <strong>llama.cpp</strong>, <strong>Ollama</strong>, and <strong>LMStudio</strong> target developer laptops and edge hardware (Jetson, M-series Macs) rather than data-center GPUs. These are not replacements for vLLM in production server contexts but are increasingly important for local development workflows and privacy-critical on-device inference scenarios.</p>
<hr>
<h2 id="faq">FAQ</h2>
<h3 id="is-tgi-dead-in-2026">Is TGI dead in 2026?</h3>
<p>Not dead, but officially in maintenance mode. Hugging Face announced in December 2025 that TGI will no longer receive new features. Security patches will continue, and HF Inference Endpoints still run on TGI. For new production deployments, Hugging Face recommends migrating to vLLM or SGLang.</p>
<h3 id="can-i-run-vllm-on-amd-gpus">Can I run vLLM on AMD GPUs?</h3>
<p>Yes. vLLM has supported AMD ROCm GPUs since v0.4 and the support has matured significantly in 2025–2026. Performance on AMD MI300X is competitive with NVIDIA A100 for FP16 workloads. TensorRT-LLM is NVIDIA-only; SGLang also supports ROCm on select configurations.</p>
<h3 id="how-does-ray-serve-differ-from-kubernetes-with-vllm">How does Ray Serve differ from Kubernetes with vLLM?</h3>
<p>Kubernetes handles container scheduling and node-level autoscaling; Ray Serve operates at the application layer within a Ray cluster and handles request routing, replica management, and model-level autoscaling. They are complementary: many production setups run Ray clusters on Kubernetes. Ray Serve gives you finer-grained control over model serving logic without writing custom Kubernetes operators.</p>
<h3 id="what-is-radixattention-and-why-does-it-matter">What is RadixAttention and why does it matter?</h3>
<p>RadixAttention is SGLang&rsquo;s KV cache management system that stores cache entries indexed by token prefix hashes in a radix tree structure. When new requests share a common prefix with previous requests (system prompts, conversation history, few-shot examples), the cached KV values are reused instead of recomputed. This achieves 85–95% cache hit rates on multi-turn workloads, directly reducing GPU computation and latency for follow-up turns.</p>
<h3 id="how-much-does-it-cost-to-run-vllm-vs-a-cloud-api-like-openai">How much does it cost to run vLLM vs a cloud API like OpenAI?</h3>
<p>The break-even calculation depends heavily on GPU amortization and utilization. At 80%+ GPU utilization on H100 hardware, on-premise vLLM serving Llama-3.1-70B typically costs $0.15–0.35 per million output tokens fully loaded (hardware, power, ops). GPT-4o is priced at $10/million output tokens (April 2026). For high-volume workloads, on-premise vLLM delivers 30–60× cost reduction, which is the primary driver of the market&rsquo;s 24.1% CAGR growth through 2030.</p>
]]></content:encoded></item></channel></rss>