<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Api on RockB</title><link>https://baeseokjae.github.io/tags/api/</link><description>Recent content in Api on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 21 Apr 2026 00:11:38 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/api/index.xml" rel="self" type="application/rss+xml"/><item><title>OpenAI Responses API Tutorial 2026: Build Stateful AI Apps in Python</title><link>https://baeseokjae.github.io/posts/openai-responses-api-tutorial-2026/</link><pubDate>Tue, 21 Apr 2026 00:11:38 +0000</pubDate><guid>https://baeseokjae.github.io/posts/openai-responses-api-tutorial-2026/</guid><description>Complete OpenAI Responses API tutorial 2026: stateful conversations, built-in tools, function calling, and migration from Chat Completions.</description><content:encoded><![CDATA[<p>The OpenAI Responses API is the new primary interface for building stateful, agentic AI applications — replacing the Assistants API (being sunset H1 2026) and extending beyond what Chat Completions can do. This tutorial walks through everything from your first API call to building multi-step agents with built-in tools like web search and file retrieval.</p>
<h2 id="what-is-the-openai-responses-api">What Is the OpenAI Responses API?</h2>
<p>The OpenAI Responses API is a stateful, tool-native interface for building AI agents and multi-turn applications — launched in March 2025 as OpenAI&rsquo;s replacement for the Assistants API and a significant evolution beyond Chat Completions. Unlike Chat Completions, which is stateless (every request requires you to resend the full conversation history), Responses API maintains conversation state server-side using <code>previous_response_id</code>. A 10-turn conversation with Chat Completions resends your entire history on turn 10, making it up to 5x more expensive for long dialogues. Responses API sends only the new message each turn — the server already holds context. Built-in tools (web search at $25–50/1K queries, file search at $2.50/1K queries) are first-class citizens rather than custom function definitions, and reasoning tokens from o3 and o4-mini are preserved between turns instead of being discarded. OpenAI has moved all example code in the openai-python repository to Responses API patterns — it is where the platform is going.</p>
<h3 id="key-architecture-concepts">Key Architecture Concepts</h3>
<p>The Responses API is built around three core primitives that differ from Chat Completions:</p>
<ul>
<li><strong>Response objects</strong> — Each API call returns a Response object with an <code>id</code> field. Pass this as <code>previous_response_id</code> in the next call to chain turns without resending history.</li>
<li><strong>Built-in tools</strong> — <code>web_search_preview</code>, <code>file_search</code>, and <code>computer_use_preview</code> are activated by including them in the <code>tools</code> array. No custom server infrastructure required.</li>
<li><strong>Semantic streaming events</strong> — Instead of raw token deltas, streaming emits structured events like <code>response.output_item.added</code>, <code>response.content_part.added</code>, and <code>response.done</code>.</li>
</ul>
<h2 id="chat-completions-vs-responses-api-vs-assistants-api">Chat Completions vs Responses API vs Assistants API</h2>
<p>The Responses API occupies a distinct position: it is more capable than Chat Completions for stateful and agentic workflows, while being simpler and cheaper than the Assistants API that it is replacing. Understanding which to use requires knowing what each one manages for you versus what you manage yourself. Chat Completions gives you maximum control (you own all state, all persistence, all tool execution loops) at the cost of client-side complexity. Responses API moves state management and tool orchestration server-side while keeping the request/response model familiar. Assistants API managed Threads, Runs, and Files as persistent objects — a full lifecycle that developers found overly complex for most use cases. OpenAI is converging on Responses API as the primary stateful API.</p>
<table>
  <thead>
      <tr>
          <th>Feature</th>
          <th>Chat Completions</th>
          <th>Responses API</th>
          <th>Assistants API</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>State management</td>
          <td>Client-side</td>
          <td>Server-side</td>
          <td>Server-side (Threads)</td>
      </tr>
      <tr>
          <td>Built-in tools</td>
          <td>No</td>
          <td>Yes</td>
          <td>Yes (Code Interpreter, etc.)</td>
      </tr>
      <tr>
          <td>Reasoning token preservation</td>
          <td>No</td>
          <td>Yes</td>
          <td>No</td>
      </tr>
      <tr>
          <td>Pricing overhead</td>
          <td>Lowest</td>
          <td>Medium</td>
          <td>Highest</td>
      </tr>
      <tr>
          <td>Streaming events</td>
          <td>Raw token deltas</td>
          <td>Semantic events</td>
          <td>SSE stream</td>
      </tr>
      <tr>
          <td>Status</td>
          <td>Active</td>
          <td>Active (primary)</td>
          <td>Sunset H1 2026</td>
      </tr>
      <tr>
          <td>Multi-provider support</td>
          <td>Wide</td>
          <td>Open Responses spec</td>
          <td>OpenAI only</td>
      </tr>
  </tbody>
</table>
<p>The migration path from Assistants to Responses is the most urgent — H1 2026 sunset means any Threads/Runs code needs to be ported now.</p>
<h2 id="getting-started-your-first-responses-api-call">Getting Started: Your First Responses API Call</h2>
<p>Making your first Responses API call requires the <code>openai</code> Python package (version ≥ 1.66.0 for full Responses support) and an API key. The shape of the request is close to Chat Completions but uses a different method and a different response schema. The critical difference from Chat Completions is the <code>input</code> parameter instead of <code>messages</code>, and the <code>model</code> field supporting all GPT-4o, o3, and o4-mini identifiers. The response is a <code>Response</code> object with an <code>id</code> field that enables state chaining, <code>output</code> containing the model&rsquo;s reply, and usage statistics. You do not need to configure threads, assistants, or vector stores before making your first call — just the model and the input.</p>
<p><strong>Install and authenticate:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bash" data-lang="bash"><span style="display:flex;"><span>pip install openai&gt;<span style="color:#f92672">=</span>1.66.0
</span></span><span style="display:flex;"><span>export OPENAI_API_KEY<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;sk-...&#34;</span>
</span></span></code></pre></div><p><strong>Your first call (Python):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Explain the difference between Responses API and Chat Completions in one paragraph.&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>print(response<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Response ID: </span><span style="color:#e6db74">{</span>response<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)  <span style="color:#75715e"># Save this for multi-turn</span>
</span></span></code></pre></div><p><strong>JavaScript/TypeScript equivalent:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-javascript" data-lang="javascript"><span style="display:flex;"><span><span style="color:#66d9ef">import</span> <span style="color:#a6e22e">OpenAI</span> <span style="color:#a6e22e">from</span> <span style="color:#e6db74">&#34;openai&#34;</span>;
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">client</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">new</span> <span style="color:#a6e22e">OpenAI</span>();
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">const</span> <span style="color:#a6e22e">response</span> <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> <span style="color:#a6e22e">client</span>.<span style="color:#a6e22e">responses</span>.<span style="color:#a6e22e">create</span>({
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">model</span><span style="color:#f92672">:</span> <span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">input</span><span style="color:#f92672">:</span> <span style="color:#e6db74">&#34;Explain the difference between Responses API and Chat Completions.&#34;</span>
</span></span><span style="display:flex;"><span>});
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">console</span>.<span style="color:#a6e22e">log</span>(<span style="color:#a6e22e">response</span>.<span style="color:#a6e22e">output</span>[<span style="color:#ae81ff">0</span>].<span style="color:#a6e22e">content</span>[<span style="color:#ae81ff">0</span>].<span style="color:#a6e22e">text</span>);
</span></span><span style="display:flex;"><span><span style="color:#a6e22e">console</span>.<span style="color:#a6e22e">log</span>(<span style="color:#e6db74">`Response ID: </span><span style="color:#e6db74">${</span><span style="color:#a6e22e">response</span>.<span style="color:#a6e22e">id</span><span style="color:#e6db74">}</span><span style="color:#e6db74">`</span>);
</span></span></code></pre></div><p>The response object structure is different from <code>ChatCompletion</code> — <code>output</code> is a list of items, each with a <code>content</code> list. Text is at <code>response.output[0].content[0].text</code>.</p>
<h2 id="server-side-state-management-with-previous_response_id">Server-Side State Management with previous_response_id</h2>
<p>Server-side state management via <code>previous_response_id</code> is the most significant capability that Responses API adds over Chat Completions. When you pass a <code>previous_response_id</code> to a new request, the OpenAI server reconstructs the conversation context internally — you only send the new user message, not the full history. This eliminates the most expensive part of long conversations: re-tokenizing and re-encoding historical messages on every turn. For a 10-turn conversation with 500 tokens per turn, Chat Completions sends approximately 5,000 tokens on turn 10 (full history) while Responses API sends roughly 500 tokens (just the new input). At scale across thousands of daily active users, this is not a marginal difference. Reasoning tokens from o3 and o4-mini are also preserved — the model&rsquo;s internal chain-of-thought from turn 3 informs turn 7, producing more coherent agentic behavior than Chat Completions where that reasoning context is lost.</p>
<p><strong>Multi-turn conversation example:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Turn 1</span>
</span></span><span style="display:flex;"><span>response_1 <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;I&#39;m building a Python web scraper. Where should I start?&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#34;Assistant:&#34;</span>, response_1<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Turn 2 — only send new message, server holds context</span>
</span></span><span style="display:flex;"><span>response_2 <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    previous_response_id<span style="color:#f92672">=</span>response_1<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Which HTTP library would you recommend for async scraping?&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#34;Assistant:&#34;</span>, response_2<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Turn 3 — chain continues</span>
</span></span><span style="display:flex;"><span>response_3 <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    previous_response_id<span style="color:#f92672">=</span>response_2<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Show me a basic example using that library.&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">&#34;Assistant:&#34;</span>, response_3<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text)
</span></span></code></pre></div><p>Store <code>response.id</code> in your database alongside the user session. When the user returns, load their latest <code>response_id</code> and pass it as <code>previous_response_id</code> — the conversation resumes with full context.</p>
<h3 id="managing-state-in-production">Managing State in Production</h3>
<p>For production applications, treat <code>previous_response_id</code> like a foreign key in your session table:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> sqlite3
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>db <span style="color:#f92672">=</span> sqlite3<span style="color:#f92672">.</span>connect(<span style="color:#e6db74">&#34;sessions.db&#34;</span>)
</span></span><span style="display:flex;"><span>db<span style="color:#f92672">.</span>execute(<span style="color:#e6db74">&#34;CREATE TABLE IF NOT EXISTS sessions (user_id TEXT, last_response_id TEXT)&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">chat</span>(user_id: str, message: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    row <span style="color:#f92672">=</span> db<span style="color:#f92672">.</span>execute(<span style="color:#e6db74">&#34;SELECT last_response_id FROM sessions WHERE user_id=?&#34;</span>, (user_id,))<span style="color:#f92672">.</span>fetchone()
</span></span><span style="display:flex;"><span>    prev_id <span style="color:#f92672">=</span> row[<span style="color:#ae81ff">0</span>] <span style="color:#66d9ef">if</span> row <span style="color:#66d9ef">else</span> <span style="color:#66d9ef">None</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>        model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>        input<span style="color:#f92672">=</span>message,
</span></span><span style="display:flex;"><span>        previous_response_id<span style="color:#f92672">=</span>prev_id
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    new_id <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>id
</span></span><span style="display:flex;"><span>    db<span style="color:#f92672">.</span>execute(<span style="color:#e6db74">&#34;INSERT OR REPLACE INTO sessions VALUES (?, ?)&#34;</span>, (user_id, new_id))
</span></span><span style="display:flex;"><span>    db<span style="color:#f92672">.</span>commit()
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> response<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text
</span></span></code></pre></div><h2 id="built-in-tools-web-search-file-search-and-computer-use">Built-in Tools: Web Search, File Search, and Computer Use</h2>
<p>Built-in tools in the Responses API replace custom infrastructure that developers previously had to build and maintain themselves. Web search (<code>web_search_preview</code>) lets the model query the live web and return cited results without you managing a search API key or result parsing logic. File search (<code>file_search</code>) enables semantic retrieval over uploaded documents using OpenAI-hosted vector stores — at $2.50 per 1,000 queries with the first gigabyte of storage free and $0.10/GB/day after that. Computer use (<code>computer_use_preview</code>) allows the model to control a browser or desktop environment, opening the door to automation workflows that were previously limited to specialized tools. These tools are activated by listing them in the <code>tools</code> array of your request — no separate SDK, no custom endpoints. The model decides when to invoke them based on the user&rsquo;s input, executes them server-side, and returns the enriched response in a single API call.</p>
<p><strong>Web search tool:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    tools<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;web_search_preview&#34;</span>}],
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;What are the latest OpenAI API pricing changes in 2026?&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Response includes citations</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> item <span style="color:#f92672">in</span> response<span style="color:#f92672">.</span>output:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> item<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;message&#34;</span>:
</span></span><span style="display:flex;"><span>        print(item<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">elif</span> item<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;web_search_call&#34;</span>:
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Searched: </span><span style="color:#e6db74">{</span>item<span style="color:#f92672">.</span>query<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p><strong>File search with vector store:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Upload files first</span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">&#34;docs/api_reference.pdf&#34;</span>, <span style="color:#e6db74">&#34;rb&#34;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>    file <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>files<span style="color:#f92672">.</span>create(file<span style="color:#f92672">=</span>f, purpose<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;assistants&#34;</span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Create vector store</span>
</span></span><span style="display:flex;"><span>vs <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>vector_stores<span style="color:#f92672">.</span>create(name<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;API Docs&#34;</span>)
</span></span><span style="display:flex;"><span>client<span style="color:#f92672">.</span>vector_stores<span style="color:#f92672">.</span>files<span style="color:#f92672">.</span>create(vector_store_id<span style="color:#f92672">=</span>vs<span style="color:#f92672">.</span>id, file_id<span style="color:#f92672">=</span>file<span style="color:#f92672">.</span>id)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Query with file search</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    tools<span style="color:#f92672">=</span>[{
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;file_search&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;vector_store_ids&#34;</span>: [vs<span style="color:#f92672">.</span>id]
</span></span><span style="display:flex;"><span>    }],
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;What are the rate limits for the Responses API?&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>print(response<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text)
</span></span></code></pre></div><p><strong>Tool pricing summary:</strong></p>
<table>
  <thead>
      <tr>
          <th>Tool</th>
          <th>Cost</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><code>web_search_preview</code></td>
          <td>$25–50 per 1,000 queries</td>
      </tr>
      <tr>
          <td><code>file_search</code></td>
          <td>$2.50 per 1,000 queries + $0.10/GB/day storage (first GB free)</td>
      </tr>
      <tr>
          <td><code>computer_use_preview</code></td>
          <td>Billed at model token rates + compute</td>
      </tr>
  </tbody>
</table>
<h2 id="function-calling-with-the-responses-api">Function Calling with the Responses API</h2>
<p>Function calling in the Responses API follows the same five-step loop as Chat Completions, but integrates cleanly with server-side state so you do not need to manually reconstruct conversation history after each tool execution. The loop is: define tools → send request → model returns <code>function_call</code> items in <code>output</code> → execute functions locally → send results back with <code>previous_response_id</code> → model generates final response. Strict mode (<code>strict: true</code>) uses constrained decoding at token generation time to guarantee 100% schema compliance — critical for production agents where a malformed JSON response would break your execution logic. Parallel tool calls allow the model to request multiple function executions in a single response; you run all of them simultaneously and return all results in one follow-up request.</p>
<p><strong>Five-step function calling loop:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 1: Define tools</span>
</span></span><span style="display:flex;"><span>tools <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;function&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;get_weather&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;description&#34;</span>: <span style="color:#e6db74">&#34;Get current weather for a city&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;parameters&#34;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;object&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;properties&#34;</span>: {
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;city&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;string&#34;</span>},
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;units&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;string&#34;</span>, <span style="color:#e6db74">&#34;enum&#34;</span>: [<span style="color:#e6db74">&#34;celsius&#34;</span>, <span style="color:#e6db74">&#34;fahrenheit&#34;</span>]}
</span></span><span style="display:flex;"><span>            },
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;required&#34;</span>: [<span style="color:#e6db74">&#34;city&#34;</span>, <span style="color:#e6db74">&#34;units&#34;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;additionalProperties&#34;</span>: <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;strict&#34;</span>: <span style="color:#66d9ef">True</span>  <span style="color:#75715e"># Step 1b: Enable strict mode</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 2: Send request</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    tools<span style="color:#f92672">=</span>tools,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;What&#39;s the weather in Tokyo and Berlin?&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 3: Check for tool calls</span>
</span></span><span style="display:flex;"><span>tool_calls <span style="color:#f92672">=</span> [item <span style="color:#66d9ef">for</span> item <span style="color:#f92672">in</span> response<span style="color:#f92672">.</span>output <span style="color:#66d9ef">if</span> item<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;function_call&#34;</span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 4: Execute functions</span>
</span></span><span style="display:flex;"><span>results <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">for</span> tc <span style="color:#f92672">in</span> tool_calls:
</span></span><span style="display:flex;"><span>    args <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>loads(tc<span style="color:#f92672">.</span>arguments)
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Your actual implementation</span>
</span></span><span style="display:flex;"><span>    weather_data <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#34;temperature&#34;</span>: <span style="color:#ae81ff">18</span>, <span style="color:#e6db74">&#34;condition&#34;</span>: <span style="color:#e6db74">&#34;partly cloudy&#34;</span>}
</span></span><span style="display:flex;"><span>    results<span style="color:#f92672">.</span>append({
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;function_call_output&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;call_id&#34;</span>: tc<span style="color:#f92672">.</span>call_id,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;output&#34;</span>: json<span style="color:#f92672">.</span>dumps(weather_data)
</span></span><span style="display:flex;"><span>    })
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Step 5: Send results, get final response</span>
</span></span><span style="display:flex;"><span>final <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    previous_response_id<span style="color:#f92672">=</span>response<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span>results
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>print(final<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text)
</span></span></code></pre></div><h3 id="parallel-tool-calls">Parallel Tool Calls</h3>
<p>When the model needs multiple data points, it can request them all at once. Execute in parallel and return all results together:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> asyncio
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">async</span> <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">execute_tool</span>(tc):
</span></span><span style="display:flex;"><span>    args <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>loads(tc<span style="color:#f92672">.</span>arguments)
</span></span><span style="display:flex;"><span>    <span style="color:#75715e"># Async execution of each tool call</span>
</span></span><span style="display:flex;"><span>    result <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> fetch_data(args)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;function_call_output&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;call_id&#34;</span>: tc<span style="color:#f92672">.</span>call_id,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;output&#34;</span>: json<span style="color:#f92672">.</span>dumps(result)
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>tool_calls <span style="color:#f92672">=</span> [item <span style="color:#66d9ef">for</span> item <span style="color:#f92672">in</span> response<span style="color:#f92672">.</span>output <span style="color:#66d9ef">if</span> item<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;function_call&#34;</span>]
</span></span><span style="display:flex;"><span>results <span style="color:#f92672">=</span> <span style="color:#66d9ef">await</span> asyncio<span style="color:#f92672">.</span>gather(<span style="color:#f92672">*</span>[execute_tool(tc) <span style="color:#66d9ef">for</span> tc <span style="color:#f92672">in</span> tool_calls])
</span></span></code></pre></div><p>For dependent operations (tool B requires tool A&rsquo;s output), set <code>parallel_tool_calls: False</code> or use o3/o4-mini which naturally sequences calls based on reasoning.</p>
<h2 id="strict-mode-and-schema-enforcement-for-production">Strict Mode and Schema Enforcement for Production</h2>
<p>Strict mode in the Responses API&rsquo;s function calling achieves 100% schema compliance by applying constrained decoding at the token generation level — the model cannot produce a token that would violate your JSON schema. This is fundamentally different from prompt-level instructions (&ldquo;always return valid JSON&rdquo;) which can fail under adversarial inputs or long context. For production agents processing thousands of tool call cycles, even a 0.1% JSON parse failure rate creates operational overhead: error logging, retry logic, fallback handling, user-facing error states. Strict mode eliminates this class of failure entirely at generation time. The requirement is that your schema uses only supported types (<code>string</code>, <code>number</code>, <code>boolean</code>, <code>object</code>, <code>array</code>, <code>null</code>), sets <code>additionalProperties: false</code> on all objects, and marks all properties as <code>required</code>. These constraints are strict mode&rsquo;s trade-off: less flexible schemas in exchange for guaranteed compliance.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>tool_schema <span style="color:#f92672">=</span> {
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;function&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;create_ticket&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;description&#34;</span>: <span style="color:#e6db74">&#34;Create a support ticket in the system&#34;</span>,
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;parameters&#34;</span>: {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;object&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;properties&#34;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;title&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;string&#34;</span>},
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;priority&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;string&#34;</span>, <span style="color:#e6db74">&#34;enum&#34;</span>: [<span style="color:#e6db74">&#34;low&#34;</span>, <span style="color:#e6db74">&#34;medium&#34;</span>, <span style="color:#e6db74">&#34;high&#34;</span>, <span style="color:#e6db74">&#34;critical&#34;</span>]},
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;assignee_id&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: [<span style="color:#e6db74">&#34;string&#34;</span>, <span style="color:#e6db74">&#34;null&#34;</span>]},
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;tags&#34;</span>: {
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;array&#34;</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;items&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;string&#34;</span>}
</span></span><span style="display:flex;"><span>            }
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;required&#34;</span>: [<span style="color:#e6db74">&#34;title&#34;</span>, <span style="color:#e6db74">&#34;priority&#34;</span>, <span style="color:#e6db74">&#34;assignee_id&#34;</span>, <span style="color:#e6db74">&#34;tags&#34;</span>],
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;additionalProperties&#34;</span>: <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>    },
</span></span><span style="display:flex;"><span>    <span style="color:#e6db74">&#34;strict&#34;</span>: <span style="color:#66d9ef">True</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><p>With <code>strict: True</code>, if the model cannot fit a value into your schema, it will use <code>null</code> for nullable fields rather than hallucinating invalid values.</p>
<h2 id="streaming-with-semantic-events">Streaming with Semantic Events</h2>
<p>Streaming in the Responses API uses structured semantic events rather than the raw <code>choices[0].delta.content</code> tokens you get from Chat Completions. This matters for building reactive UIs and agent orchestration loops: you know exactly when a tool call starts, when content is being added, and when the response is complete — without parsing partial JSON or managing your own buffer state. Semantic events include <code>response.output_item.added</code> (new output item starting), <code>response.content_part.added</code> (new content part), <code>response.output_text.delta</code> (token-by-token text), <code>response.tool_call.arguments.delta</code> (streaming tool call arguments), and <code>response.done</code> (full response complete with final object). This is a meaningful ergonomic improvement for streaming agents because tool call arguments arrive incrementally — you can start validation or UI feedback before the full JSON is assembled.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">with</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>stream(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    tools<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;web_search_preview&#34;</span>}],
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Search for the latest news on OpenAI Responses API&#34;</span>
</span></span><span style="display:flex;"><span>) <span style="color:#66d9ef">as</span> stream:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">for</span> event <span style="color:#f92672">in</span> stream:
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;response.output_text.delta&#34;</span>:
</span></span><span style="display:flex;"><span>            print(event<span style="color:#f92672">.</span>delta, end<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;&#34;</span>, flush<span style="color:#f92672">=</span><span style="color:#66d9ef">True</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">elif</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;response.output_item.added&#34;</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">if</span> event<span style="color:#f92672">.</span>item<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;web_search_call&#34;</span>:
</span></span><span style="display:flex;"><span>                print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#ae81ff">\n</span><span style="color:#e6db74">[Searching: </span><span style="color:#e6db74">{</span>event<span style="color:#f92672">.</span>item<span style="color:#f92672">.</span>query<span style="color:#e6db74">}</span><span style="color:#e6db74">]&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">elif</span> event<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;response.done&#34;</span>:
</span></span><span style="display:flex;"><span>            print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#ae81ff">\n\n</span><span style="color:#e6db74">Final response ID: </span><span style="color:#e6db74">{</span>event<span style="color:#f92672">.</span>response<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><h2 id="cost-architecture-when-to-use-which-api">Cost Architecture: When to Use Which API</h2>
<p>The Responses API sits between Chat Completions (lowest cost) and Assistants API (highest overhead) in terms of cost structure. For short, single-turn interactions, Chat Completions is still cheaper — there is no state storage overhead and no per-query tool pricing. For conversations longer than 3–4 turns, Responses API often wins because you stop paying to re-tokenize history: a 10-turn conversation with 500 tokens of context per turn costs roughly 5,000 input tokens on Chat Completions turn 10 vs roughly 500 tokens on Responses API. The break-even point depends on your average conversation length and token costs for your chosen model. Built-in tools add per-use costs but replace infrastructure you would otherwise build: a self-hosted web search integration requires API keys, result parsing, prompt injection into context, and maintenance. At $25–50/1K queries, <code>web_search_preview</code> is often cheaper than developer time for low-to-medium volume applications.</p>
<table>
  <thead>
      <tr>
          <th>Scenario</th>
          <th>Recommended API</th>
          <th>Reason</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Single-turn completions, high volume</td>
          <td>Chat Completions</td>
          <td>No state overhead</td>
      </tr>
      <tr>
          <td>Multi-turn chat (3+ turns)</td>
          <td>Responses API</td>
          <td>Avoids history resend cost</td>
      </tr>
      <tr>
          <td>Document Q&amp;A with file retrieval</td>
          <td>Responses API + file_search</td>
          <td>Built-in vector store</td>
      </tr>
      <tr>
          <td>Web-augmented research agents</td>
          <td>Responses API + web_search</td>
          <td>No custom search infra</td>
      </tr>
      <tr>
          <td>Legacy Assistants code</td>
          <td>Migrate to Responses</td>
          <td>Assistants sunset H1 2026</td>
      </tr>
      <tr>
          <td>Multi-provider portability</td>
          <td>Responses API (Open Responses spec)</td>
          <td>Works on Ollama, vLLM, etc.</td>
      </tr>
  </tbody>
</table>
<h2 id="the-open-responses-specification">The Open Responses Specification</h2>
<p>The Open Responses specification is a multi-provider API standard backed by OpenAI, Nvidia, Vercel, OpenRouter, Hugging Face, LM Studio, Ollama, and vLLM — defining a shared interface for stateful AI responses that any compatible server can implement. This matters for developers building on the Responses API because it means your code is not locked to OpenAI infrastructure. Ollama added Open Responses support in v0.13.3 (non-stateful flavor for local models), and vLLM ships a fully compatible server for self-hosted deployments. Azure OpenAI also supports the Responses API through its own hosted endpoint. The specification defines the request/response schema, streaming event format, and tool calling protocol — the same <code>previous_response_id</code> chaining, same <code>tools</code> array format, same semantic streaming events. Write once, run on OpenAI, Azure, local Ollama, or any vLLM deployment.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Point to any Open Responses-compatible server</span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI(
</span></span><span style="display:flex;"><span>    api_key<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;ollama&#34;</span>,  <span style="color:#75715e"># or your local API key</span>
</span></span><span style="display:flex;"><span>    base_url<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;http://localhost:11434/v1/responses&#34;</span>  <span style="color:#75715e"># local Ollama</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Same code works — just the endpoint changes</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;llama3.2&#34;</span>,
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;Explain stateful conversation management.&#34;</span>
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><h2 id="migrating-from-chat-completions-to-responses-api">Migrating from Chat Completions to Responses API</h2>
<p>Migrating from Chat Completions to Responses API is the most straightforward upgrade path because the model IDs are identical, the tool definition format is compatible, and you can migrate incrementally — route new features to Responses API while leaving existing Chat Completions code untouched. The surface-level change is <code>client.chat.completions.create()</code> → <code>client.responses.create()</code>, <code>messages</code> → <code>input</code>, and manually managed history → <code>previous_response_id</code>. For streaming, swap <code>for chunk in stream</code> token handling for semantic event processing. The deeper change is architectural: you stop owning conversation state in your database and delegate it to OpenAI&rsquo;s server, keeping only the <code>response_id</code> as a foreign key.</p>
<p><strong>Before (Chat Completions):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>history <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">chat</span>(message: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    history<span style="color:#f92672">.</span>append({<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: message})
</span></span><span style="display:flex;"><span>    response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>chat<span style="color:#f92672">.</span>completions<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>        model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>        messages<span style="color:#f92672">=</span>history  <span style="color:#75715e"># Full history every time</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    reply <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>choices[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>message<span style="color:#f92672">.</span>content
</span></span><span style="display:flex;"><span>    history<span style="color:#f92672">.</span>append({<span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;assistant&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>: reply})
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> reply
</span></span></code></pre></div><p><strong>After (Responses API):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span>last_response_id <span style="color:#f92672">=</span> <span style="color:#66d9ef">None</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">chat</span>(message: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">global</span> last_response_id
</span></span><span style="display:flex;"><span>    response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>        model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>        input<span style="color:#f92672">=</span>message,
</span></span><span style="display:flex;"><span>        previous_response_id<span style="color:#f92672">=</span>last_response_id  <span style="color:#75715e"># Just the ID</span>
</span></span><span style="display:flex;"><span>    )
</span></span><span style="display:flex;"><span>    last_response_id <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>id
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">return</span> response<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text
</span></span></code></pre></div><h2 id="migrating-from-assistants-api-before-h1-2026-sunset">Migrating from Assistants API Before H1 2026 Sunset</h2>
<p>The Assistants API is being sunset in H1 2026, which means any production code using Threads, Runs, Messages, or Assistants objects needs to be ported to Responses API before that date. The migration is not a one-to-one mapping — the conceptual model is different — but the capabilities are equivalent or improved. Threads (persistent conversation containers) map to <code>previous_response_id</code> chains. Runs (execution units with polling) are replaced by single synchronous or streaming Responses API calls. Messages objects (structured conversation history) are replaced by the <code>output</code> array in each Response. Assistants (reusable agent configurations with tools and system prompts) map to per-request <code>instructions</code> and <code>tools</code> parameters, or can be encapsulated in a Python class. The main operational change: you no longer poll for Run completion — Responses API calls block until complete (or stream incrementally).</p>
<p><strong>Assistants API pattern (to replace):</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># OLD: Assistants API (sunset H1 2026)</span>
</span></span><span style="display:flex;"><span>thread <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>threads<span style="color:#f92672">.</span>create()
</span></span><span style="display:flex;"><span>client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>threads<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>create(thread_id<span style="color:#f92672">=</span>thread<span style="color:#f92672">.</span>id, role<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;user&#34;</span>, content<span style="color:#f92672">=</span>message)
</span></span><span style="display:flex;"><span>run <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>threads<span style="color:#f92672">.</span>runs<span style="color:#f92672">.</span>create_and_poll(thread_id<span style="color:#f92672">=</span>thread<span style="color:#f92672">.</span>id, assistant_id<span style="color:#f92672">=</span>assistant_id)
</span></span><span style="display:flex;"><span>messages <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>threads<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>list(thread_id<span style="color:#f92672">=</span>thread<span style="color:#f92672">.</span>id)
</span></span><span style="display:flex;"><span>reply <span style="color:#f92672">=</span> messages<span style="color:#f92672">.</span>data[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text<span style="color:#f92672">.</span>value
</span></span></code></pre></div><p><strong>Responses API equivalent:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># NEW: Responses API</span>
</span></span><span style="display:flex;"><span>response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>    instructions<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;You are a helpful assistant specializing in Python development.&#34;</span>,
</span></span><span style="display:flex;"><span>    tools<span style="color:#f92672">=</span>[{<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;file_search&#34;</span>, <span style="color:#e6db74">&#34;vector_store_ids&#34;</span>: [vs_id]}],
</span></span><span style="display:flex;"><span>    input<span style="color:#f92672">=</span>message,
</span></span><span style="display:flex;"><span>    previous_response_id<span style="color:#f92672">=</span>prev_response_id  <span style="color:#75715e"># replaces Thread</span>
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>reply <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text
</span></span><span style="display:flex;"><span>prev_response_id <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>id  <span style="color:#75715e"># store for next turn</span>
</span></span></code></pre></div><h2 id="building-a-complete-agent-end-to-end-tutorial">Building a Complete Agent: End-to-End Tutorial</h2>
<p>A complete Responses API agent combines server-side state, built-in tools, and function calling into a workflow that handles multi-step reasoning without manual orchestration loops. The following agent answers research questions by searching the web, retrieving relevant files, and synthesizing a cited response — all in a single Responses API call that handles tool execution internally when using built-in tools, or across two calls when using custom functions.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> json
</span></span><span style="display:flex;"><span><span style="color:#f92672">from</span> openai <span style="color:#f92672">import</span> OpenAI
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> OpenAI()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Agent configuration</span>
</span></span><span style="display:flex;"><span>TOOLS <span style="color:#f92672">=</span> [
</span></span><span style="display:flex;"><span>    {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;web_search_preview&#34;</span>},
</span></span><span style="display:flex;"><span>    {
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;function&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;name&#34;</span>: <span style="color:#e6db74">&#34;save_to_notes&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;description&#34;</span>: <span style="color:#e6db74">&#34;Save a research finding to the user&#39;s notes&#34;</span>,
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;parameters&#34;</span>: {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;object&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;properties&#34;</span>: {
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;title&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;string&#34;</span>},
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;content&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;string&#34;</span>},
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;tags&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;array&#34;</span>, <span style="color:#e6db74">&#34;items&#34;</span>: {<span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;string&#34;</span>}}
</span></span><span style="display:flex;"><span>            },
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;required&#34;</span>: [<span style="color:#e6db74">&#34;title&#34;</span>, <span style="color:#e6db74">&#34;content&#34;</span>, <span style="color:#e6db74">&#34;tags&#34;</span>],
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;additionalProperties&#34;</span>: <span style="color:#66d9ef">False</span>
</span></span><span style="display:flex;"><span>        },
</span></span><span style="display:flex;"><span>        <span style="color:#e6db74">&#34;strict&#34;</span>: <span style="color:#66d9ef">True</span>
</span></span><span style="display:flex;"><span>    }
</span></span><span style="display:flex;"><span>]
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>SYSTEM_PROMPT <span style="color:#f92672">=</span> <span style="color:#e6db74">&#34;&#34;&#34;You are a research assistant. When asked a question:
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">1. Search the web for current information
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">2. Synthesize findings with citations
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">3. If the user asks to save findings, use the save_to_notes function
</span></span></span><span style="display:flex;"><span><span style="color:#e6db74">Always cite your sources.&#34;&#34;&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">class</span> <span style="color:#a6e22e">ResearchAgent</span>:
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> __init__(self):
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>notes <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>last_response_id <span style="color:#f92672">=</span> <span style="color:#66d9ef">None</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">def</span> <span style="color:#a6e22e">run</span>(self, user_message: str) <span style="color:#f92672">-&gt;</span> str:
</span></span><span style="display:flex;"><span>        response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>            model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>            instructions<span style="color:#f92672">=</span>SYSTEM_PROMPT,
</span></span><span style="display:flex;"><span>            tools<span style="color:#f92672">=</span>TOOLS,
</span></span><span style="display:flex;"><span>            input<span style="color:#f92672">=</span>user_message,
</span></span><span style="display:flex;"><span>            previous_response_id<span style="color:#f92672">=</span>self<span style="color:#f92672">.</span>last_response_id
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Handle function calls (built-in tools execute automatically)</span>
</span></span><span style="display:flex;"><span>        function_calls <span style="color:#f92672">=</span> [i <span style="color:#66d9ef">for</span> i <span style="color:#f92672">in</span> response<span style="color:#f92672">.</span>output <span style="color:#66d9ef">if</span> i<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;function_call&#34;</span>]
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> function_calls:
</span></span><span style="display:flex;"><span>            results <span style="color:#f92672">=</span> []
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">for</span> fc <span style="color:#f92672">in</span> function_calls:
</span></span><span style="display:flex;"><span>                args <span style="color:#f92672">=</span> json<span style="color:#f92672">.</span>loads(fc<span style="color:#f92672">.</span>arguments)
</span></span><span style="display:flex;"><span>                <span style="color:#66d9ef">if</span> fc<span style="color:#f92672">.</span>name <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;save_to_notes&#34;</span>:
</span></span><span style="display:flex;"><span>                    self<span style="color:#f92672">.</span>notes<span style="color:#f92672">.</span>append(args)
</span></span><span style="display:flex;"><span>                    result <span style="color:#f92672">=</span> {<span style="color:#e6db74">&#34;saved&#34;</span>: <span style="color:#66d9ef">True</span>, <span style="color:#e6db74">&#34;note_count&#34;</span>: len(self<span style="color:#f92672">.</span>notes)}
</span></span><span style="display:flex;"><span>                results<span style="color:#f92672">.</span>append({
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;type&#34;</span>: <span style="color:#e6db74">&#34;function_call_output&#34;</span>,
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;call_id&#34;</span>: fc<span style="color:#f92672">.</span>call_id,
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;output&#34;</span>: json<span style="color:#f92672">.</span>dumps(result)
</span></span><span style="display:flex;"><span>                })
</span></span><span style="display:flex;"><span>            <span style="color:#75715e"># Get final response after function execution</span>
</span></span><span style="display:flex;"><span>            response <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>responses<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>                model<span style="color:#f92672">=</span><span style="color:#e6db74">&#34;gpt-4o&#34;</span>,
</span></span><span style="display:flex;"><span>                previous_response_id<span style="color:#f92672">=</span>response<span style="color:#f92672">.</span>id,
</span></span><span style="display:flex;"><span>                input<span style="color:#f92672">=</span>results
</span></span><span style="display:flex;"><span>            )
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>        self<span style="color:#f92672">.</span>last_response_id <span style="color:#f92672">=</span> response<span style="color:#f92672">.</span>id
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">return</span> response<span style="color:#f92672">.</span>output[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># Usage</span>
</span></span><span style="display:flex;"><span>agent <span style="color:#f92672">=</span> ResearchAgent()
</span></span><span style="display:flex;"><span>print(agent<span style="color:#f92672">.</span>run(<span style="color:#e6db74">&#34;What are the key features of the OpenAI Responses API launched in 2025?&#34;</span>))
</span></span><span style="display:flex;"><span>print(agent<span style="color:#f92672">.</span>run(<span style="color:#e6db74">&#34;Save those findings to my notes with the tag &#39;openai-api&#39;&#34;</span>))
</span></span><span style="display:flex;"><span>print(agent<span style="color:#f92672">.</span>run(<span style="color:#e6db74">&#34;What questions do I still have based on what we&#39;ve discussed?&#34;</span>))
</span></span></code></pre></div><hr>
<h2 id="faq">FAQ</h2>
<p>The OpenAI Responses API introduces a fundamentally different programming model compared to Chat Completions and the now-sunset Assistants API. The most common questions from developers migrating existing applications center on state management, cost implications, and tool compatibility. These answers address the questions that come up most frequently when teams evaluate or implement the Responses API in production systems — covering <code>previous_response_id</code> chaining, the Assistants API sunset timeline, multi-provider portability via the Open Responses specification, cost savings on long conversations, and the interaction between custom function calling and built-in tools. Each answer is self-contained and reflects the current Responses API behavior as of April 2026. The Responses API launched in March 2025 and has since become OpenAI&rsquo;s primary recommended interface for stateful and agentic applications, with the openai-python library updated to use Responses API patterns throughout its examples.</p>
<h3 id="what-is-the-difference-between-openai-responses-api-and-chat-completions">What is the difference between OpenAI Responses API and Chat Completions?</h3>
<p>The key difference is state management. Chat Completions is stateless — you send the full conversation history on every request and manage persistence yourself. Responses API maintains conversation state server-side via <code>previous_response_id</code>, so each turn only sends the new message. Responses API also includes built-in tools (web search, file search) that Chat Completions lacks, and preserves reasoning tokens between turns for o3 and o4-mini models.</p>
<h3 id="when-will-the-assistants-api-be-sunset">When will the Assistants API be sunset?</h3>
<p>OpenAI has announced the Assistants API will be sunset in H1 2026. This means any production code using Threads, Runs, Messages, or the Assistants beta endpoints needs to be migrated to the Responses API before that deadline. The migration is well-documented and the Responses API provides all equivalent capabilities — stateful conversations, file retrieval, and tool use.</p>
<h3 id="is-the-openai-responses-api-available-on-azure-openai">Is the OpenAI Responses API available on Azure OpenAI?</h3>
<p>Yes. Azure OpenAI supports the Responses API through its hosted endpoint. Additionally, the Open Responses specification backed by Nvidia, Vercel, OpenRouter, and others enables the same API surface on Ollama (v0.13.3+), vLLM, and other compatible servers. The <code>base_url</code> parameter in the OpenAI Python client lets you point to any compatible server.</p>
<h3 id="how-does-previous_response_id-save-money-on-long-conversations">How does <code>previous_response_id</code> save money on long conversations?</h3>
<p>In a 10-turn conversation with Chat Completions, turn 10 sends the entire 9-turn history plus the new message — potentially thousands of tokens of input. With Responses API, turn 10 only sends the new message (a few hundred tokens) because the server already holds the full context. OpenAI estimates Chat Completions can be up to 5x more expensive for long conversations due to this history re-tokenization cost.</p>
<h3 id="can-i-use-both-function-calling-and-built-in-tools-in-the-same-responses-api-call">Can I use both function calling and built-in tools in the same Responses API call?</h3>
<p>Yes. You can include both custom function definitions and built-in tools (like <code>web_search_preview</code> or <code>file_search</code>) in the same <code>tools</code> array. The model will call whichever tools are relevant to the user&rsquo;s request. Built-in tools execute server-side and their results appear automatically in <code>response.output</code>, while custom function calls require your client to execute them and return results via a follow-up request with <code>previous_response_id</code>.</p>
]]></content:encoded></item></channel></rss>