<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Openai-Compatible on RockB</title><link>https://baeseokjae.github.io/tags/openai-compatible/</link><description>Recent content in Openai-Compatible on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 02 Jun 2026 14:42:11 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/openai-compatible/index.xml" rel="self" type="application/rss+xml"/><item><title>Ollama API Guide: Run Local LLMs with REST API and OpenAI-Compatible SDK</title><link>https://baeseokjae.github.io/posts/ollama-api-local-llm-integration-developer-guide-2026/</link><pubDate>Tue, 02 Jun 2026 14:42:11 +0000</pubDate><guid>https://baeseokjae.github.io/posts/ollama-api-local-llm-integration-developer-guide-2026/</guid><description>Complete Ollama API guide: REST endpoints, OpenAI-compatible /v1/ drop-in, Python SDK, embeddings, RAG pipelines, and Docker production deployment.</description><content:encoded><![CDATA[<p>Ollama is an open-source local LLM runtime that exposes a REST API on <code>http://localhost:11434</code>, letting you run Llama 4, Qwen3, DeepSeek R1, Gemma 4, and 4,500+ other models entirely on your machine — with zero per-token cost and no data leaving your network. The OpenAI-compatible <code>/v1/</code> layer means most existing SDK code works after a one-line <code>base_url</code> change.</p>
<h2 id="why-local-llms-went-mainstream-in-2026">Why Local LLMs Went Mainstream in 2026</h2>
<p>Local LLM adoption crossed a meaningful threshold in 2026, driven by economics, privacy regulation, and dramatically improved model quality in small footprints. Ollama surpassed 170,000 GitHub stars — the most starred local LLM runtime project on the platform — and monthly downloads grew from 100K in Q1 2023 to 52 million in Q1 2026, a 520x increase in three years. The stat that matters most for developer decision-making: 42% of developers now run at least some LLM workloads entirely on local machines, up from single digits in 2023. The economic case is straightforward — a team of five developers can spend $3,000–$30,000 in cloud LLM API costs over a three-month development cycle before shipping a single production feature. Local inference eliminates that cost entirely during the iteration phase. HuggingFace now hosts 135,000 GGUF-formatted models optimized for local inference, up from just 200 three years ago, giving developers access to a deep catalog. For regulated industries — healthcare, finance, government — local deployment isn&rsquo;t just economical, it&rsquo;s frequently mandatory: patient data, financial records, and classified documents cannot traverse cloud APIs. Ollama handles this by design.</p>
<h3 id="what-changed-between-2023-and-2026">What Changed Between 2023 and 2026</h3>
<p>The 2023 local LLM experience involved constant friction: manual GGUF downloads, complex llama.cpp invocations, no API compatibility, and GPU configuration that required reading three separate blog posts. Ollama&rsquo;s contribution was packaging all of this into a single binary with a clean HTTP API. In 2026, the API surface has stabilized around two layers: the native Ollama REST API for full model lifecycle control, and the OpenAI-compatible <code>/v1/</code> endpoints for dropping into existing code with zero changes. Hardware improvements matter too — Apple Silicon M3/M4 machines run 13B parameter models at 40+ tokens per second without a GPU, making high-quality inference accessible on standard developer laptops.</p>
<h2 id="what-is-ollama--architecture-and-how-the-server-works">What Is Ollama — Architecture and How the Server Works</h2>
<p>Ollama is a Go-based server that wraps llama.cpp inference, model management, and an HTTP API into a single self-contained binary. When you start Ollama, it binds to port 11434 and manages models stored in <code>~/.ollama/models</code> (macOS/Linux) or <code>C:\Users\{user}\.ollama\models</code> (Windows). The server handles concurrent requests by queuing them against a single model instance — only one model loads at a time by default, with GPU VRAM being the binding constraint. The architecture is intentionally simple: no separate database process, no configuration files required to start, no authentication layer for local use. Models are stored in a content-addressed format derived from the Modelfile, which is a Dockerfile-like specification that declares the base model, system prompt, temperature, and other parameters. The REST API exposes two namespaces — <code>/api/</code> for native Ollama operations and <code>/v1/</code> for OpenAI-compatible operations. The underlying inference engine is llama.cpp, which means Ollama inherits its quantization support (Q4_K_M, Q8_0, FP16, etc.) and hardware acceleration backends: CUDA for NVIDIA GPUs, Metal for Apple Silicon, ROCm for AMD GPUs, and CPU fallback for any machine. Ollama typically achieves 15–20% faster inference than LocalAI for equivalent LLM workloads due to tighter integration with llama.cpp&rsquo;s optimization passes.</p>
<h2 id="installation-on-macos-linux-and-windows">Installation on macOS, Linux, and Windows</h2>
<p>Ollama installs to a running server in under a minute on all three platforms. Each installs the <code>ollama</code> CLI and starts the background server automatically.</p>
<p><strong>macOS:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>brew install ollama
</span></span><span style="display:flex;"><span><span style="color:#75715e"># or download from https://ollama.com/download/mac</span>
</span></span><span style="display:flex;"><span>ollama serve  <span style="color:#75715e"># starts server; or it starts automatically via the macOS app</span>
</span></span></code></pre></div><p><strong>Linux (one-line install):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl -fsSL https://ollama.com/install.sh | sh
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Server starts as a systemd service: systemctl status ollama</span>
</span></span></code></pre></div><p><strong>Windows:</strong>
Download the installer from <code>https://ollama.com/download/windows</code> — it installs a system tray app that starts the server at login.</p>
<p><strong>Verify the server is running:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Returns: &#34;Ollama is running&#34;</span>
</span></span></code></pre></div><p><strong>Pull your first model:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>ollama pull llama3.2          <span style="color:#75715e"># 2B parameter, ~1.3GB</span>
</span></span><span style="display:flex;"><span>ollama pull qwen3:7b          <span style="color:#75715e"># 7B parameter, ~4.7GB</span>
</span></span><span style="display:flex;"><span>ollama pull deepseek-r1:8b    <span style="color:#75715e"># 8B, strong reasoning</span>
</span></span><span style="display:flex;"><span>ollama pull nomic-embed-text  <span style="color:#75715e"># embedding model</span>
</span></span></code></pre></div><p>Model sizes to know: 7B Q4_K_M ≈ 4.5GB RAM/VRAM, 13B Q4_K_M ≈ 8GB, 70B Q4_K_M ≈ 40GB. For machines with 16GB RAM, 7B–13B models are the practical sweet spot.</p>
<h2 id="the-native-ollama-rest-api--every-endpoint-explained">The Native Ollama REST API — Every Endpoint Explained</h2>
<p>The native Ollama REST API lives under <code>/api/</code> and provides full programmatic control over text generation, multi-turn chat, embeddings, and model lifecycle management — all over plain HTTP with JSON request and response bodies. The server binds to <code>http://localhost:11434</code> by default and requires no authentication for local use. There are seven primary endpoints in the native API: <code>/api/generate</code> for single-turn completions, <code>/api/chat</code> for multi-turn conversations, <code>/api/embed</code> for vector embeddings, <code>/api/tags</code> to list installed models, <code>/api/pull</code> to download models, <code>/api/delete</code> to remove models, and <code>/api/show</code> to inspect model metadata. Streaming is enabled by default — set <code>&quot;stream&quot;: false</code> to receive a single JSON response instead of newline-delimited chunks. Every endpoint that runs inference accepts an <code>options</code> object where you can override model parameters: <code>temperature</code>, <code>top_p</code>, <code>num_ctx</code> (context length), <code>num_predict</code> (max output tokens), and <code>stop</code> sequences. All parameter changes are per-request; there is no persistent session state on the server side. Understanding which endpoint to use for which workload is the first step to building reliable Ollama-backed applications.</p>
<h3 id="generate-apigenerate">Generate: /api/generate</h3>
<p>The generate endpoint runs single-turn completions against a raw prompt.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/generate <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;model&#34;: &#34;llama3.2&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;prompt&#34;: &#34;Explain Docker multi-stage builds in one paragraph.&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;stream&#34;: false,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;options&#34;: {
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">      &#34;temperature&#34;: 0.7,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">      &#34;top_p&#34;: 0.9,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">      &#34;num_ctx&#34;: 4096
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    }
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">  }&#39;</span>
</span></span></code></pre></div><p>Key parameters in <code>options</code>:</p>
<ul>
<li><code>temperature</code> (0.0–2.0): randomness; 0.1 for factual tasks, 0.7–1.0 for creative</li>
<li><code>top_p</code> (0.0–1.0): nucleus sampling; 0.9 is a sensible default</li>
<li><code>num_ctx</code>: context window size in tokens; defaults to 2048, increase for long documents</li>
<li><code>num_predict</code>: max tokens to generate; -1 for unlimited</li>
<li><code>stop</code>: array of stop sequences</li>
</ul>
<p><strong>With streaming (default):</strong> Set <code>&quot;stream&quot;: true</code> (or omit it) and the response comes as newline-delimited JSON objects, each containing a <code>response</code> field, ending with a final object where <code>&quot;done&quot;: true</code>.</p>
<h3 id="chat-apichat">Chat: /api/chat</h3>
<p>The chat endpoint manages multi-turn conversations with a messages array, matching OpenAI&rsquo;s message format.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/chat <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;model&#34;: &#34;llama3.2&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;messages&#34;: [
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">      {&#34;role&#34;: &#34;system&#34;, &#34;content&#34;: &#34;You are a senior Python developer.&#34;},
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">      {&#34;role&#34;: &#34;user&#34;, &#34;content&#34;: &#34;What is the difference between __str__ and __repr__?&#34;}
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    ],
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;stream&#34;: false
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">  }&#39;</span>
</span></span></code></pre></div><p>The response includes a <code>message</code> object with <code>role: &quot;assistant&quot;</code> and <code>content</code>. For multi-turn, append the assistant response to the messages array and send again — Ollama is stateless; you own the conversation history.</p>
<h3 id="embeddings-apiembed">Embeddings: /api/embed</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/embed <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;model&#34;: &#34;nomic-embed-text&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;input&#34;: &#34;Ollama makes local LLM inference simple.&#34;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">  }&#39;</span>
</span></span></code></pre></div><p>Returns <code>{&quot;embeddings&quot;: [[0.123, -0.456, ...]]}</code>. The <code>input</code> field accepts a string or array of strings for batch embedding. <code>nomic-embed-text</code> produces 768-dimensional vectors; <code>mxbai-embed-large</code> produces 1024-dimensional.</p>
<h2 id="openai-compatible-api-at-v1--drop-in-replacement-for-existing-apps">OpenAI-Compatible API at /v1/ — Drop-In Replacement for Existing Apps</h2>
<p>The <code>/v1/</code> namespace is where Ollama becomes immediately practical for teams with existing OpenAI SDK integrations. Supported endpoints mirror the OpenAI API surface: <code>/v1/chat/completions</code>, <code>/v1/completions</code>, <code>/v1/embeddings</code>, and <code>/v1/models</code>. Authentication accepts any non-empty string as the API key — use <code>&quot;ollama&quot;</code> or any placeholder. The critical migration path: if you have code calling <code>api.openai.com</code>, changing <code>base_url</code> to <code>http://localhost:11434/v1</code> is the entire migration for the core chat and embedding workflows. No other code changes needed. This compatibility layer was the key architectural decision that drove Ollama&rsquo;s adoption past competitors — developers don&rsquo;t need to learn a new API surface to use it. Supported chat features include system messages, multi-turn history, <code>temperature</code>/<code>top_p</code>/<code>max_tokens</code> parameters, streaming via SSE, and function/tool calling on models that support it (Llama 3.1+, Qwen2.5+). Unsupported OpenAI features include: logprobs, fine-tuning endpoints, assistants API, files API, and image generation. For those features, you either don&rsquo;t need them locally or they require model-specific handling.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span><span style="color:#75715e"># Same curl you&#39;d use against OpenAI, just different base URL</span>
</span></span><span style="display:flex;"><span>curl http://localhost:11434/v1/chat/completions <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Content-Type: application/json&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -H <span style="color:#e6db74">&#34;Authorization: Bearer ollama&#34;</span> <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;model&#34;: &#34;llama3.2&#34;,
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    &#34;messages&#34;: [{&#34;role&#34;: &#34;user&#34;, &#34;content&#34;: &#34;Write a Python hello world&#34;}]
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">  }&#39;</span>
</span></span></code></pre></div><h2 id="python-integration-native-ollama-library-vs-openai-sdk">Python Integration: Native ollama Library vs OpenAI SDK</h2>
<p>Python is the dominant language for LLM integration work, and Ollama supports two distinct SDK paths: the native <code>ollama</code> Python library (which mirrors the REST API directly), and the <code>openai</code> Python SDK pointed at Ollama&rsquo;s <code>/v1/</code> compatibility layer. The native <code>ollama</code> library, installed via <code>pip install ollama</code>, wraps every <code>/api/</code> endpoint with typed Python functions — <code>ollama.generate()</code>, <code>ollama.chat()</code>, <code>ollama.embed()</code>, <code>ollama.pull()</code>, <code>ollama.list()</code> — and handles streaming, async, and error handling out of the box. The <code>openai</code> SDK path requires no new library if you already have it installed — just pass <code>base_url='http://localhost:11434/v1'</code> and any string as <code>api_key</code>, and all existing code that calls <code>client.chat.completions.create()</code> or <code>client.embeddings.create()</code> works unchanged. The right choice is practical: if you&rsquo;re starting fresh or want the full Ollama feature set (model management, Modelfile inspection, etc.), use the native library. If you have existing OpenAI SDK code and want a zero-change migration path for local development, use the OpenAI SDK with a custom base URL. Both approaches produce identical inference results.</p>
<h3 id="native-ollama-library">Native ollama Library</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>pip install ollama
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> ollama
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Synchronous generate</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> ollama<span style="color:#f92672">.</span>generate(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;llama3.2&#39;</span>, prompt<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;What is GGUF format?&#39;</span>)
</span></span><span style="display:flex;"><span>print(response[<span style="color:#e6db74">&#39;response&#39;</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Chat with history</span>
</span></span><span style="display:flex;"><span>messages <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    {<span style="color:#e6db74">&#39;role&#39;</span>: <span style="color:#e6db74">&#39;system&#39;</span>, <span style="color:#e6db74">&#39;content&#39;</span>: <span style="color:#e6db74">&#39;You are a helpful coding assistant.&#39;</span>},
</span></span><span style="display:flex;"><span>    {<span style="color:#e6db74">&#39;role&#39;</span>: <span style="color:#e6db74">&#39;user&#39;</span>, <span style="color:#e6db74">&#39;content&#39;</span>: <span style="color:#e6db74">&#39;Explain Python generators.&#39;</span>}
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> ollama<span style="color:#f92672">.</span>chat(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;llama3.2&#39;</span>, messages<span style="color:#f92672">=</span>messages)
</span></span><span style="display:flex;"><span>print(response[<span style="color:#e6db74">&#39;message&#39;</span>][<span style="color:#e6db74">&#39;content&#39;</span>])
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Streaming</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> chunk <span style="color:#f92672">in</span> ollama<span style="color:#f92672">.</span>generate(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;llama3.2&#39;</span>, prompt<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Count to 5.&#39;</span>, stream<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>):
</span></span><span style="display:flex;"><span>    print(chunk[<span style="color:#e6db74">&#39;response&#39;</span>], end<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;&#39;</span>, flush<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Embeddings</span>
</span></span><span style="display:flex;"><span>result <span style="color:#f92672">=</span> ollama<span style="color:#f92672">.</span>embed(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;nomic-embed-text&#39;</span>, input<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Hello world&#39;</span>)
</span></span><span style="display:flex;"><span>vector <span style="color:#f92672">=</span> result[<span style="color:#e6db74">&#39;embeddings&#39;</span>][<span style="color:#ae81ff">0</span>]
</span></span></code></pre></div><h3 id="openai-sdk-pointed-at-ollama">OpenAI SDK Pointed at Ollama</h3>
<p>For existing code — or when you want the same SDK interface to swap between Ollama and OpenAI based on environment:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI(
</span></span><span style="display:flex;"><span>    base_url<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;http://localhost:11434/v1&#39;</span>,
</span></span><span style="display:flex;"><span>    api_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;ollama&#39;</span>  <span style="color:#75715e"># any string works</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;llama3.2&#39;</span>,
</span></span><span style="display:flex;"><span>    messages<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#39;role&#39;</span>: <span style="color:#e6db74">&#39;user&#39;</span>, <span style="color:#e6db74">&#39;content&#39;</span>: <span style="color:#e6db74">&#39;What is RAG?&#39;</span>}]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>print(response<span style="color:#f92672">.</span>choices[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>message<span style="color:#f92672">.</span>content)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Embeddings</span>
</span></span><span style="display:flex;"><span>embedding <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>embeddings<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;nomic-embed-text&#39;</span>,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;Document text to embed&#39;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>vector <span style="color:#f92672">=</span> embedding<span style="color:#f92672">.</span>data[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>embedding
</span></span></code></pre></div><p><strong>Env-based switching pattern</strong> — useful for CI (use cloud) vs local dev (use Ollama):</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> os
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">if</span> os<span style="color:#f92672">.</span>getenv(<span style="color:#e6db74">&#39;USE_LOCAL_LLM&#39;</span>, <span style="color:#e6db74">&#39;false&#39;</span>)<span style="color:#f92672">.</span>lower() <span style="color:#f92672">==</span> <span style="color:#e6db74">&#39;true&#39;</span>:
</span></span><span style="display:flex;"><span>    client <span style="color:#f92672">=</span> OpenAI(base_url<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;http://localhost:11434/v1&#39;</span>, api_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;ollama&#39;</span>)
</span></span><span style="display:flex;"><span>    model <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;llama3.2&#39;</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">else</span>:
</span></span><span style="display:flex;"><span>    client <span style="color:#f92672">=</span> OpenAI(api_key<span style="color:#f92672">=</span>os<span style="color:#f92672">.</span>environ[<span style="color:#e6db74">&#39;OPENAI_API_KEY&#39;</span>])
</span></span><span style="display:flex;"><span>    model <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;gpt-4o-mini&#39;</span>
</span></span></code></pre></div><h2 id="streaming-responses-embeddings-and-multi-turn-chat-with-code">Streaming Responses, Embeddings, and Multi-Turn Chat with Code</h2>
<p>Streaming is the default behavior for every generation endpoint in Ollama and is the correct choice for user-facing applications — it makes responses feel fast even when model throughput is 15–20 tokens per second, because the first token appears in under a second rather than the user waiting 10–30 seconds for the full response. Ollama streaming works by sending newline-delimited JSON objects as the model generates each token. In the native API, each chunk contains a <code>response</code> field (for <code>/api/generate</code>) or a <code>message.content</code> field (for <code>/api/chat</code>), plus a <code>done: false</code> flag. The final chunk has <code>done: true</code> and includes performance metadata: <code>eval_count</code> (tokens generated), <code>eval_duration</code> (nanoseconds), <code>prompt_eval_count</code>, and <code>total_duration</code>. For the OpenAI-compatible <code>/v1/</code> endpoint, streaming follows the SSE (Server-Sent Events) format with <code>data: {&quot;choices&quot;: [...]}</code> lines ending with <code>data: [DONE]</code>. Multi-turn chat requires you to maintain the message history client-side — Ollama is stateless; each request must include the full conversation history from the start. Embeddings do not stream; they return synchronously as a single JSON response.</p>
<h3 id="streaming-with-native-api">Streaming with Native API</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> ollama
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">stream_response</span>(prompt: str, model: str <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;llama3.2&#39;</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> chunk <span style="color:#f92672">in</span> ollama<span style="color:#f92672">.</span>generate(model<span style="color:#f92672">=</span>model, prompt<span style="color:#f92672">=</span>prompt, stream<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>):
</span></span><span style="display:flex;"><span>        print(chunk[<span style="color:#e6db74">&#39;response&#39;</span>], end<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;&#39;</span>, flush<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> chunk<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#39;done&#39;</span>):
</span></span><span style="display:flex;"><span>            print()  <span style="color:#75715e"># newline at end</span>
</span></span><span style="display:flex;"><span>            print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Tokens: </span><span style="color:#e6db74">{</span>chunk<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#39;eval_count&#39;</span>, <span style="color:#ae81ff">0</span>)<span style="color:#e6db74">}</span><span style="color:#e6db74">, &#34;</span>
</span></span><span style="display:flex;"><span>                  <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Speed: </span><span style="color:#e6db74">{</span>chunk<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#39;eval_count&#39;</span>, <span style="color:#ae81ff">0</span>) <span style="color:#f92672">/</span> chunk<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#39;eval_duration&#39;</span>, <span style="color:#ae81ff">1</span>) <span style="color:#f92672">*</span> <span style="color:#ae81ff">1e9</span><span style="color:#e6db74">:</span><span style="color:#e6db74">.1f</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> t/s&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>stream_response(<span style="color:#e6db74">&#34;Explain async/await in Python in 3 sentences.&#34;</span>)
</span></span></code></pre></div><h3 id="multi-turn-chat-state-management">Multi-Turn Chat State Management</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> ollama
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> dataclasses <span style="color:#f92672">import</span> dataclass, field
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> typing <span style="color:#f92672">import</span> List
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@dataclass</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">ChatSession</span>:
</span></span><span style="display:flex;"><span>    model: str
</span></span><span style="display:flex;"><span>    system_prompt: str <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    history: List[dict] <span style="color:#f92672">=</span> field(default_factory<span style="color:#f92672">=</span>list)
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">__post_init__</span>(self):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> self<span style="color:#f92672">.</span>system_prompt:
</span></span><span style="display:flex;"><span>            self<span style="color:#f92672">.</span>history<span style="color:#f92672">.</span>append({<span style="color:#e6db74">&#39;role&#39;</span>: <span style="color:#e6db74">&#39;system&#39;</span>, <span style="color:#e6db74">&#39;content&#39;</span>: self<span style="color:#f92672">.</span>system_prompt})
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">send</span>(self, user_message: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>history<span style="color:#f92672">.</span>append({<span style="color:#e6db74">&#39;role&#39;</span>: <span style="color:#e6db74">&#39;user&#39;</span>, <span style="color:#e6db74">&#39;content&#39;</span>: user_message})
</span></span><span style="display:flex;"><span>        response <span style="color:#f92672">=</span> ollama<span style="color:#f92672">.</span>chat(model<span style="color:#f92672">=</span>self<span style="color:#f92672">.</span>model, messages<span style="color:#f92672">=</span>self<span style="color:#f92672">.</span>history)
</span></span><span style="display:flex;"><span>        assistant_msg <span style="color:#f92672">=</span> response[<span style="color:#e6db74">&#39;message&#39;</span>][<span style="color:#e6db74">&#39;content&#39;</span>]
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>history<span style="color:#f92672">.</span>append({<span style="color:#e6db74">&#39;role&#39;</span>: <span style="color:#e6db74">&#39;assistant&#39;</span>, <span style="color:#e6db74">&#39;content&#39;</span>: assistant_msg})
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> assistant_msg
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>session <span style="color:#f92672">=</span> ChatSession(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;llama3.2&#39;</span>, system_prompt<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;You are a senior Go developer.&#39;</span>)
</span></span><span style="display:flex;"><span>print(session<span style="color:#f92672">.</span>send(<span style="color:#e6db74">&#34;What is a goroutine?&#34;</span>))
</span></span><span style="display:flex;"><span>print(session<span style="color:#f92672">.</span>send(<span style="color:#e6db74">&#34;How does that compare to Python threads?&#34;</span>))
</span></span></code></pre></div><h3 id="async-streaming-for-web-applications">Async Streaming for Web Applications</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> asyncio
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> ollama
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">async_stream</span>(prompt: str):
</span></span><span style="display:flex;"><span>    async_client <span style="color:#f92672">=</span> ollama<span style="color:#f92672">.</span>AsyncClient()
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">async</span> <span style="color:#66d9ef">for</span> chunk <span style="color:#f92672">in</span> <span style="color:#66d9ef">await</span> async_client<span style="color:#f92672">.</span>generate(
</span></span><span style="display:flex;"><span>        model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;llama3.2&#39;</span>, prompt<span style="color:#f92672">=</span>prompt, stream<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>
</span></span><span style="display:flex;"><span>    ):
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">yield</span> chunk[<span style="color:#e6db74">&#39;response&#39;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># FastAPI integration</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> fastapi <span style="color:#f92672">import</span> FastAPI
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> fastapi.responses <span style="color:#f92672">import</span> StreamingResponse
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>app <span style="color:#f92672">=</span> FastAPI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">@app.get</span>(<span style="color:#e6db74">&#34;/generate&#34;</span>)
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">generate</span>(prompt: str):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> StreamingResponse(
</span></span><span style="display:flex;"><span>        async_stream(prompt),
</span></span><span style="display:flex;"><span>        media_type<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;text/event-stream&#34;</span>
</span></span><span style="display:flex;"><span>    )
</span></span></code></pre></div><h2 id="model-management-via-api-pull-list-copy-delete-inspect">Model Management via API: Pull, List, Copy, Delete, Inspect</h2>
<p>The Ollama REST API exposes full model lifecycle management through dedicated endpoints, making the CLI entirely optional for teams building automated deployment pipelines, model management dashboards, or CI workflows that need to ensure specific models are available before running tests. The five management endpoints are: <code>GET /api/tags</code> (list installed models with size, digest, and modification timestamp), <code>POST /api/pull</code> (download a model from the Ollama library with optional streaming progress), <code>DELETE /api/delete</code> (remove an installed model), <code>POST /api/copy</code> (duplicate a model under a new name — useful for custom Modelfile variants), and <code>POST /api/show</code> (return the full Modelfile, parameter defaults, template, and quantization metadata for an installed model). Pull operations stream progress as newline-delimited JSON objects with <code>status</code>, <code>completed</code>, and <code>total</code> bytes fields — you can build a progress bar directly from the stream. All management operations are synchronous from the caller&rsquo;s perspective except pull, which can take several minutes for large models. The Ollama library on the public registry contains 4,500+ models as of May 2026, including Llama 4, Qwen3, DeepSeek R1, Gemma 4, Mistral, and the full range of embedding models.</p>
<h3 id="list-installed-models">List Installed Models</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/tags
</span></span></code></pre></div><p>Returns an array of model objects with <code>name</code>, <code>size</code>, <code>digest</code>, and <code>modified_at</code> fields.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> requests
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">list_models</span>():
</span></span><span style="display:flex;"><span>    resp <span style="color:#f92672">=</span> requests<span style="color:#f92672">.</span>get(<span style="color:#e6db74">&#39;http://localhost:11434/api/tags&#39;</span>)
</span></span><span style="display:flex;"><span>    models <span style="color:#f92672">=</span> resp<span style="color:#f92672">.</span>json()[<span style="color:#e6db74">&#39;models&#39;</span>]
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> m <span style="color:#f92672">in</span> models:
</span></span><span style="display:flex;"><span>        size_gb <span style="color:#f92672">=</span> m[<span style="color:#e6db74">&#39;size&#39;</span>] <span style="color:#f92672">/</span> (<span style="color:#ae81ff">1024</span><span style="color:#f92672">**</span><span style="color:#ae81ff">3</span>)
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#e6db74">{</span>m[<span style="color:#e6db74">&#39;name&#39;</span>]<span style="color:#e6db74">:</span><span style="color:#e6db74">&lt;30</span><span style="color:#e6db74">}</span><span style="color:#e6db74"> </span><span style="color:#e6db74">{</span>size_gb<span style="color:#e6db74">:</span><span style="color:#e6db74">.1f</span><span style="color:#e6db74">}</span><span style="color:#e6db74">GB&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>list_models()
</span></span></code></pre></div><h3 id="pull-a-model">Pull a Model</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/pull <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{&#34;model&#34;: &#34;qwen3:7b&#34;, &#34;stream&#34;: false}&#39;</span>
</span></span></code></pre></div><p>With progress streaming (stream: true), each JSON line includes a <code>status</code> and optionally <code>completed</code>/<code>total</code> bytes.</p>
<h3 id="delete-a-model">Delete a Model</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl -X DELETE http://localhost:11434/api/delete <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{&#34;model&#34;: &#34;llama3.2&#34;}&#39;</span>
</span></span></code></pre></div><h3 id="inspect-model-details">Inspect Model Details</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/show <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{&#34;model&#34;: &#34;llama3.2&#34;}&#39;</span>
</span></span></code></pre></div><p>Returns the full Modelfile, parameters, template, and model metadata including context length, quantization type, and parameter count.</p>
<h3 id="copy-a-model">Copy a Model</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>curl http://localhost:11434/api/copy <span style="color:#ae81ff">\
</span></span></span><span style="display:flex;"><span><span style="color:#ae81ff"></span>  -d <span style="color:#e6db74">&#39;{&#34;source&#34;: &#34;llama3.2&#34;, &#34;destination&#34;: &#34;my-custom-llama&#34;}&#39;</span>
</span></span></code></pre></div><p>Useful for creating named variants with custom system prompts via Modelfiles without downloading additional weights.</p>
<h2 id="building-a-local-rag-pipeline-with-ollama--chromadb">Building a Local RAG Pipeline with Ollama + ChromaDB</h2>
<p>A local RAG pipeline using Ollama for both embeddings and generation, with ChromaDB as the vector store, achieves production-quality retrieval with zero external API costs. Properly configured Ollama deployments achieve 40–60% cost savings compared to cloud APIs while maintaining comparable performance on RAG tasks. The stack: <code>nomic-embed-text</code> for document embeddings (768 dimensions, 8K context), <code>llama3.2</code> or <code>qwen3:7b</code> for generation, ChromaDB for vector storage and retrieval. The architecture is entirely local — documents, embeddings, and query results never leave the machine.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>pip install chromadb ollama
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> ollama
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> chromadb
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> chromadb.utils <span style="color:#f92672">import</span> embedding_functions
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Custom embedding function using Ollama</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">OllamaEmbeddingFunction</span>(embedding_functions<span style="color:#f92672">.</span>EmbeddingFunction):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __init__(self, model: str <span style="color:#f92672">=</span> <span style="color:#e6db74">&#39;nomic-embed-text&#39;</span>):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>model <span style="color:#f92672">=</span> model
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __call__(self, input: list[str]) <span style="color:#f92672">-&gt;</span> list[list[float]]:
</span></span><span style="display:flex;"><span>        result <span style="color:#f92672">=</span> ollama<span style="color:#f92672">.</span>embed(model<span style="color:#f92672">=</span>self<span style="color:#f92672">.</span>model, input<span style="color:#f92672">=</span>input)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> result[<span style="color:#e6db74">&#39;embeddings&#39;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Initialize ChromaDB</span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> chromadb<span style="color:#f92672">.</span>Client()
</span></span><span style="display:flex;"><span>embed_fn <span style="color:#f92672">=</span> OllamaEmbeddingFunction()
</span></span><span style="display:flex;"><span>collection <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>create_collection(
</span></span><span style="display:flex;"><span>    name<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;docs&#39;</span>,
</span></span><span style="display:flex;"><span>    embedding_function<span style="color:#f92672">=</span>embed_fn
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Index documents</span>
</span></span><span style="display:flex;"><span>docs <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Ollama is an open-source local LLM runtime.&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;RAG combines retrieval with generation for accurate answers.&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;ChromaDB is an open-source vector database for embeddings.&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;Python async/await enables non-blocking concurrent execution.&#34;</span>,
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>collection<span style="color:#f92672">.</span>add(
</span></span><span style="display:flex;"><span>    documents<span style="color:#f92672">=</span>docs,
</span></span><span style="display:flex;"><span>    ids<span style="color:#f92672">=</span>[<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;doc_</span><span style="color:#e6db74">{</span>i<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span> <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> range(len(docs))]
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">rag_query</span>(question: str, n_results: int <span style="color:#f92672">=</span> <span style="color:#ae81ff">3</span>) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    results <span style="color:#f92672">=</span> collection<span style="color:#f92672">.</span>query(query_texts<span style="color:#f92672">=</span>[question], n_results<span style="color:#f92672">=</span>n_results)
</span></span><span style="display:flex;"><span>    context <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">&#34;</span><span style="color:#f92672">.</span>join(results[<span style="color:#e6db74">&#39;documents&#39;</span>][<span style="color:#ae81ff">0</span>])
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    prompt <span style="color:#f92672">=</span> <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;&#34;&#34;Answer the question based ONLY on the context below.
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">    
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">Context:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74"></span><span style="color:#e6db74">{</span>context<span style="color:#e6db74">}</span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">Question: </span><span style="color:#e6db74">{</span>question<span style="color:#e6db74">}</span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">Answer:&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    response <span style="color:#f92672">=</span> ollama<span style="color:#f92672">.</span>generate(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#39;llama3.2&#39;</span>, prompt<span style="color:#f92672">=</span>prompt)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> response[<span style="color:#e6db74">&#39;response&#39;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>print(rag_query(<span style="color:#e6db74">&#34;What is Ollama?&#34;</span>))
</span></span><span style="display:flex;"><span>print(rag_query(<span style="color:#e6db74">&#34;How does RAG work?&#34;</span>))
</span></span></code></pre></div><p>For production RAG, replace ChromaDB&rsquo;s in-memory store with its persistent client (<code>chromadb.PersistentClient(path=&quot;./chroma_db&quot;)</code>) and add chunking for large documents using <code>langchain_text_splitters.RecursiveCharacterTextSplitter</code>.</p>
<h2 id="production-deployment-docker-compose-gpu-config-and-multi-model-serving">Production Deployment: Docker Compose, GPU Config, and Multi-Model Serving</h2>
<p>Running Ollama in Docker makes the setup reproducible across team machines, staging environments, and CI pipelines, and is the recommended approach for shared team infrastructure where developers need access to the same models without each person managing a local install. The official Docker image is <code>ollama/ollama:latest</code> — it bundles the Ollama binary, CUDA libraries for NVIDIA GPU support, and a working entrypoint. For NVIDIA GPU access, the NVIDIA Container Toolkit must be installed on the host, after which the Docker Compose <code>deploy.resources.reservations.devices</code> block passes GPU access into the container. Apple Silicon GPU access is only available via the native macOS binary, not Docker. The key environment variables for production tuning are <code>OLLAMA_NUM_PARALLEL</code> (concurrent inference slots — each slot uses its own VRAM allocation), <code>OLLAMA_MAX_LOADED_MODELS</code> (how many models stay hot in memory), and <code>OLLAMA_KEEP_ALIVE</code> (how long an idle model stays loaded before being evicted). For high-availability setups, multiple Ollama instances behind an nginx or HAProxy load balancer is a proven pattern — each instance manages its own model state independently, and the load balancer distributes requests across the pool.</p>
<h3 id="docker-compose-for-team-development">Docker Compose for Team Development</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-yaml" data-lang="yaml"><span style="display:flex;"><span><span style="color:#75715e"># docker-compose.yml</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">version</span>: <span style="color:#e6db74">&#39;3.8&#39;</span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">services</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">ollama</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">image</span>: <span style="color:#ae81ff">ollama/ollama:latest</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">ports</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#e6db74">&#34;11434:11434&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">volumes</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">ollama_models:/root/.ollama</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">environment</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">OLLAMA_HOST=0.0.0.0</span>
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">OLLAMA_NUM_PARALLEL=2     </span> <span style="color:#75715e"># concurrent requests</span>
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">OLLAMA_MAX_LOADED_MODELS=2</span> <span style="color:#75715e"># models in memory simultaneously</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">restart</span>: <span style="color:#ae81ff">unless-stopped</span>
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># For NVIDIA GPU:</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">deploy</span>:
</span></span><span style="display:flex;"><span>      <span style="color:#f92672">resources</span>:
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">reservations</span>:
</span></span><span style="display:flex;"><span>          <span style="color:#f92672">devices</span>:
</span></span><span style="display:flex;"><span>            - <span style="color:#f92672">driver</span>: <span style="color:#ae81ff">nvidia</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">count</span>: <span style="color:#ae81ff">all</span>
</span></span><span style="display:flex;"><span>              <span style="color:#f92672">capabilities</span>: [<span style="color:#ae81ff">gpu]</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>  <span style="color:#75715e"># Pull models on startup</span>
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">ollama-init</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">image</span>: <span style="color:#ae81ff">curlimages/curl:latest</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">depends_on</span>:
</span></span><span style="display:flex;"><span>      - <span style="color:#ae81ff">ollama</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">command</span>: &gt;<span style="color:#e6db74">
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">      sh -c &#34;sleep 5 &amp;&amp;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             curl -s http://ollama:11434/api/pull -d &#39;{\&#34;model\&#34;: \&#34;llama3.2\&#34;}&#39; &amp;&amp;
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">             curl -s http://ollama:11434/api/pull -d &#39;{\&#34;model\&#34;: \&#34;nomic-embed-text\&#34;}&#39;&#34;</span>
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">restart</span>: <span style="color:#e6db74">&#34;no&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#f92672">volumes</span>:
</span></span><span style="display:flex;"><span>  <span style="color:#f92672">ollama_models</span>:
</span></span></code></pre></div><div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>docker compose up -d
</span></span></code></pre></div><h3 id="gpu-configuration">GPU Configuration</h3>
<p><strong>NVIDIA:</strong> Install the NVIDIA Container Toolkit, then the Docker Compose above works as-is. Verify with:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>docker exec ollama nvidia-smi
</span></span></code></pre></div><p><strong>Apple Silicon:</strong> GPU acceleration is automatic when running the native macOS binary — Metal is detected without configuration. Docker on macOS does not have Metal access; run the native binary for GPU use.</p>
<p><strong>NVIDIA without Docker:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>export CUDA_VISIBLE_DEVICES<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>  <span style="color:#75715e"># use GPU 0</span>
</span></span><span style="display:flex;"><span>ollama serve
</span></span></code></pre></div><p><strong>CPU-only fallback:</strong> All operations work without GPU — just slower. <code>OLLAMA_INTEL_GPU=1</code> enables Intel Arc GPU support on Linux.</p>
<h3 id="environment-variables-for-production-tuning">Environment Variables for Production Tuning</h3>
<table>
  <thead>
      <tr>
          <th>Variable</th>
          <th>Default</th>
          <th>Purpose</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>OLLAMA_HOST</code></td>
          <td><code>127.0.0.1:11434</code></td>
          <td>Bind address; use <code>0.0.0.0</code> for network access</td>
      </tr>
      <tr>
          <td><code>OLLAMA_NUM_PARALLEL</code></td>
          <td>1</td>
          <td>Concurrent request slots</td>
      </tr>
      <tr>
          <td><code>OLLAMA_MAX_LOADED_MODELS</code></td>
          <td>1</td>
          <td>Models to keep in VRAM simultaneously</td>
      </tr>
      <tr>
          <td><code>OLLAMA_KEEP_ALIVE</code></td>
          <td><code>5m</code></td>
          <td>How long to keep model loaded after last request</td>
      </tr>
      <tr>
          <td><code>OLLAMA_MODELS</code></td>
          <td><code>~/.ollama/models</code></td>
          <td>Custom model storage path</td>
      </tr>
      <tr>
          <td><code>OLLAMA_DEBUG</code></td>
          <td><code>false</code></td>
          <td>Verbose logging</td>
      </tr>
  </tbody>
</table>
<p><strong>Key tuning decision:</strong> <code>OLLAMA_KEEP_ALIVE=&quot;-1&quot;</code> keeps the model permanently loaded — eliminates cold-start latency (5–30 seconds) at the cost of dedicated VRAM.</p>
<h3 id="serving-multiple-models">Serving Multiple Models</h3>
<p>With <code>OLLAMA_MAX_LOADED_MODELS=2</code> and sufficient VRAM, Ollama serves a generation model and embedding model simultaneously — the common RAG pattern. Requests to a model that isn&rsquo;t loaded trigger an automatic load (evicting the LRU model if at capacity).</p>
<p>For high-concurrency production deployments, run multiple Ollama instances on different ports and load-balance with nginx:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-nginx" data-lang="nginx"><span style="display:flex;"><span><span style="color:#66d9ef">upstream</span> <span style="color:#e6db74">ollama_cluster</span> {
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">server</span> localhost:<span style="color:#ae81ff">11434</span>;
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">server</span> localhost:<span style="color:#ae81ff">11435</span>;
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">server</span> localhost:<span style="color:#ae81ff">11436</span>;
</span></span><span style="display:flex;"><span>}
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">server</span> {
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">listen</span> <span style="color:#ae81ff">8080</span>;
</span></span><span style="display:flex;"><span>    <span style="color:#f92672">location</span> <span style="color:#e6db74">/</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">proxy_pass</span> <span style="color:#e6db74">http://ollama_cluster</span>;
</span></span><span style="display:flex;"><span>        <span style="color:#f92672">proxy_read_timeout</span> <span style="color:#e6db74">300s</span>;
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><hr>
<h2 id="faq">FAQ</h2>
<p><strong>Q: Does Ollama work without a GPU?</strong>
Yes. All models run on CPU with automatic fallback. A 7B model runs at 5–15 tokens/second on a modern CPU — usable for development and batch processing, though slower than GPU. Apple Silicon Macs are a special case: the M-series chips use unified memory for both CPU and GPU, achieving 30–60 tokens/second on 7B models without a discrete GPU.</p>
<p><strong>Q: How do I use Ollama with LangChain?</strong>
Install <code>langchain-ollama</code> and use <code>ChatOllama</code> and <code>OllamaEmbeddings</code>:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> langchain_ollama <span style="color:#f92672">import</span> ChatOllama, OllamaEmbeddings
</span></span><span style="display:flex;"><span>llm <span style="color:#f92672">=</span> ChatOllama(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;llama3.2&#34;</span>, temperature<span style="color:#f92672">=</span><span style="color:#ae81ff">0</span>)
</span></span><span style="display:flex;"><span>embeddings <span style="color:#f92672">=</span> OllamaEmbeddings(model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;nomic-embed-text&#34;</span>)
</span></span></code></pre></div><p>This plugs directly into any LangChain chain, agent, or retriever.</p>
<p><strong>Q: What&rsquo;s the difference between /api/generate and /api/chat?</strong>
<code>/api/generate</code> takes a single <code>prompt</code> string and is stateless — the model sees exactly what you send. <code>/api/chat</code> takes a <code>messages</code> array (system/user/assistant roles) and is designed for multi-turn conversations. Under the hood, <code>/api/chat</code> formats the messages into the model&rsquo;s chat template automatically. Use <code>/api/chat</code> for conversational interfaces and <code>/api/generate</code> for completion tasks where you want full control over the prompt.</p>
<p><strong>Q: Can I run multiple models at the same time?</strong>
Yes, with <code>OLLAMA_MAX_LOADED_MODELS</code> set to 2 or higher. Each loaded model occupies VRAM independently. Requests round-robin or queue per-model. The practical limit is VRAM: two 7B Q4_K_M models need ~9GB VRAM total.</p>
<p><strong>Q: How do I add a custom system prompt permanently?</strong>
Create a Modelfile:</p>



<div class="goat svg-container ">
  
    <svg
      xmlns="http://www.w3.org/2000/svg"
      font-family="Menlo,Lucida Console,monospace"
      
        viewBox="0 0 560 57"
      >
      <g transform='translate(8,16)'>
<text text-anchor='middle' x='0' y='4' fill='currentColor' style='font-size:1em'>F</text>
<text text-anchor='middle' x='0' y='20' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='0' y='36' fill='currentColor' style='font-size:1em'>P</text>
<text text-anchor='middle' x='8' y='4' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='8' y='20' fill='currentColor' style='font-size:1em'>Y</text>
<text text-anchor='middle' x='8' y='36' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='16' y='4' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='16' y='20' fill='currentColor' style='font-size:1em'>S</text>
<text text-anchor='middle' x='16' y='36' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='24' y='4' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='24' y='20' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='24' y='36' fill='currentColor' style='font-size:1em'>A</text>
<text text-anchor='middle' x='32' y='20' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='32' y='36' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='40' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='40' y='20' fill='currentColor' style='font-size:1em'>M</text>
<text text-anchor='middle' x='40' y='36' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='48' y='4' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='48' y='36' fill='currentColor' style='font-size:1em'>T</text>
<text text-anchor='middle' x='56' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='56' y='20' fill='currentColor' style='font-size:1em'>"</text>
<text text-anchor='middle' x='56' y='36' fill='currentColor' style='font-size:1em'>E</text>
<text text-anchor='middle' x='64' y='4' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='64' y='20' fill='currentColor' style='font-size:1em'>Y</text>
<text text-anchor='middle' x='64' y='36' fill='currentColor' style='font-size:1em'>R</text>
<text text-anchor='middle' x='72' y='4' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='72' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='80' y='4' fill='currentColor' style='font-size:1em'>3</text>
<text text-anchor='middle' x='80' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='80' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='88' y='4' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='88' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='96' y='4' fill='currentColor' style='font-size:1em'>2</text>
<text text-anchor='middle' x='96' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='96' y='36' fill='currentColor' style='font-size:1em'>m</text>
<text text-anchor='middle' x='104' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='104' y='36' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='112' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='112' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='120' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='128' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='128' y='36' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='136' y='36' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='144' y='20' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='144' y='36' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='152' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='152' y='36' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='160' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='160' y='36' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='168' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='176' y='20' fill='currentColor' style='font-size:1em'>o</text>
<text text-anchor='middle' x='176' y='36' fill='currentColor' style='font-size:1em'>0</text>
<text text-anchor='middle' x='184' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='184' y='36' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='192' y='36' fill='currentColor' style='font-size:1em'>3</text>
<text text-anchor='middle' x='200' y='20' fill='currentColor' style='font-size:1em'>D</text>
<text text-anchor='middle' x='208' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='216' y='20' fill='currentColor' style='font-size:1em'>v</text>
<text text-anchor='middle' x='224' y='20' fill='currentColor' style='font-size:1em'>O</text>
<text text-anchor='middle' x='232' y='20' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='240' y='20' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='256' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='264' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='272' y='20' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='280' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='288' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='296' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='304' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='312' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='328' y='20' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='336' y='20' fill='currentColor' style='font-size:1em'>p</text>
<text text-anchor='middle' x='344' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='352' y='20' fill='currentColor' style='font-size:1em'>c</text>
<text text-anchor='middle' x='360' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='368' y='20' fill='currentColor' style='font-size:1em'>a</text>
<text text-anchor='middle' x='376' y='20' fill='currentColor' style='font-size:1em'>l</text>
<text text-anchor='middle' x='384' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='392' y='20' fill='currentColor' style='font-size:1em'>z</text>
<text text-anchor='middle' x='400' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='408' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='416' y='20' fill='currentColor' style='font-size:1em'>g</text>
<text text-anchor='middle' x='432' y='20' fill='currentColor' style='font-size:1em'>i</text>
<text text-anchor='middle' x='440' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='456' y='20' fill='currentColor' style='font-size:1em'>K</text>
<text text-anchor='middle' x='464' y='20' fill='currentColor' style='font-size:1em'>u</text>
<text text-anchor='middle' x='472' y='20' fill='currentColor' style='font-size:1em'>b</text>
<text text-anchor='middle' x='480' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='488' y='20' fill='currentColor' style='font-size:1em'>r</text>
<text text-anchor='middle' x='496' y='20' fill='currentColor' style='font-size:1em'>n</text>
<text text-anchor='middle' x='504' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='512' y='20' fill='currentColor' style='font-size:1em'>t</text>
<text text-anchor='middle' x='520' y='20' fill='currentColor' style='font-size:1em'>e</text>
<text text-anchor='middle' x='528' y='20' fill='currentColor' style='font-size:1em'>s</text>
<text text-anchor='middle' x='536' y='20' fill='currentColor' style='font-size:1em'>.</text>
<text text-anchor='middle' x='544' y='20' fill='currentColor' style='font-size:1em'>"</text>
</g>

    </svg>
  
</div>
<p>Then: <code>ollama create my-devops-model -f Modelfile</code>. The new model shows up in <code>/api/tags</code> and accepts API calls like any other model.</p>
]]></content:encoded></item></channel></rss>