<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Long-Form Generation on RockB</title><link>https://baeseokjae.github.io/tags/long-form-generation/</link><description>Recent content in Long-Form Generation on RockB</description><image><title>RockB</title><url>https://baeseokjae.github.io/images/og-default.png</url><link>https://baeseokjae.github.io/images/og-default.png</link></image><generator>Hugo</generator><language>en-us</language><lastBuildDate>Mon, 27 Apr 2026 01:04:22 +0000</lastBuildDate><atom:link href="https://baeseokjae.github.io/tags/long-form-generation/index.xml" rel="self" type="application/rss+xml"/><item><title>Claude API 300K Output Tokens: Complete Guide to Long-Form Generation (2026)</title><link>https://baeseokjae.github.io/posts/claude-api-max-tokens-300k-guide-2026/</link><pubDate>Mon, 27 Apr 2026 01:04:22 +0000</pubDate><guid>https://baeseokjae.github.io/posts/claude-api-max-tokens-300k-guide-2026/</guid><description>How to unlock Claude API 300K output tokens via the Message Batches API with the output-300k-2026-03-24 beta header — with Python code examples.</description><content:encoded><![CDATA[<p>The Claude API now supports up to 300,000 output tokens per request — roughly 460 pages of text in a single API call — but only through the Message Batches API with a specific beta header. The synchronous API remains capped at 64K tokens. This guide explains exactly how to enable 300K output, which models support it, when to use it, and what it costs.</p>
<h2 id="what-are-claude-api-300k-output-tokens">What Are Claude API 300K Output Tokens?</h2>
<p>Claude API 300K output tokens refers to Anthropic&rsquo;s maximum per-request generation limit, available on Claude Sonnet 4.6, Opus 4.6, and Opus 4.7 via the asynchronous Message Batches API. At approximately 650 words per 1,000 tokens, 300,000 tokens translates to roughly 195,000 words — the equivalent of a 460-page technical document or a full software codebase migration in a single API call. This capability is unlocked by passing the <code>output-300k-2026-03-24</code> beta header with your batch request; without it, even Sonnet 4.6 caps at 64K tokens on synchronous calls. The 300K limit represents a 4.7× increase over the previous 64K ceiling and is the highest output token limit of any major LLM API in 2026 — GPT-4o Long Output tops out at 64K, and Gemini 1.5 Pro at 8K. For enterprises running document generation, codebase analysis, or legal drafting pipelines, this change fundamentally alters the economics of LLM-based automation.</p>
<h2 id="which-models-support-300k-output-tokens">Which Models Support 300K Output Tokens?</h2>
<p>The 300K output token limit is available on specific Claude models only, and the standard 8K or 64K limits still apply everywhere else. Knowing which model to use prevents silent truncation and wasted API spend.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Standard Output Limit</th>
          <th>300K Batch Output</th>
          <th>Batch Pricing (Output/MTok)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Opus 4.7</td>
          <td>32K</td>
          <td>Yes</td>
          <td>$12.50</td>
      </tr>
      <tr>
          <td>Claude Opus 4.6</td>
          <td>32K</td>
          <td>Yes</td>
          <td>$12.50</td>
      </tr>
      <tr>
          <td>Claude Sonnet 4.6</td>
          <td>64K</td>
          <td>Yes</td>
          <td>$7.50</td>
      </tr>
      <tr>
          <td>Claude Haiku 4.5</td>
          <td>8K</td>
          <td>No</td>
          <td>$1.25</td>
      </tr>
      <tr>
          <td>Claude Sonnet 3.7</td>
          <td>64K</td>
          <td>No</td>
          <td>—</td>
      </tr>
  </tbody>
</table>
<p><strong>Key rule:</strong> Only models in the Claude 4.x family with the <code>output-300k-2026-03-24</code> beta header on a Message Batches API request get 300K output. Haiku 4.5 and all 3.x-series models are excluded. When in doubt, use Sonnet 4.6 for cost-efficiency (batch output at $7.50/MTok) or Opus 4.7 for maximum instruction-following fidelity on deeply structured documents.</p>
<h2 id="how-to-enable-300k-output-the-output-300k-2026-03-24-beta-header">How to Enable 300K Output: The output-300k-2026-03-24 Beta Header</h2>
<p>Enabling the 300K output limit requires adding a single <code>anthropic-beta</code> header to your Message Batches API request. This is the only mechanism Anthropic has exposed for 300K output as of April 2026 — there is no console toggle, no SDK flag, and no way to enable it on synchronous <code>/v1/messages</code> calls.</p>
<p>The specific header value is <code>output-300k-2026-03-24</code>. The date suffix is part of the official header name and must be included exactly. Setting <code>max_tokens: 300000</code> in your request body alone will not unlock 300K output without this header — the API will silently clamp your response to the model&rsquo;s default maximum.</p>
<p>To add the header in the Python SDK:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> anthropic
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> anthropic<span style="color:#f92672">.</span>Anthropic()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># For direct HTTP — add to headers dict:</span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># &#34;anthropic-beta&#34;: &#34;output-300k-2026-03-24&#34;</span>
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#75715e"># For SDK-level batch creation, pass it as:</span>
</span></span><span style="display:flex;"><span>batch <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    requests<span style="color:#f92672">=</span>[<span style="color:#f92672">...</span>],
</span></span><span style="display:flex;"><span>    betas<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;output-300k-2026-03-24&#34;</span>],  <span style="color:#75715e"># SDK convenience param</span>
</span></span><span style="display:flex;"><span>)
</span></span></code></pre></div><p>The <code>betas</code> parameter on the SDK&rsquo;s batch create method maps directly to the <code>anthropic-beta</code> HTTP header. You can pass multiple beta features as a list (e.g., <code>[&quot;output-300k-2026-03-24&quot;, &quot;interleaved-thinking-2025-05-14&quot;]</code>). This approach is compatible with prompt caching — add <code>cache_control</code> to your system or user blocks as normal.</p>
<h2 id="step-by-step-using-the-message-batches-api-for-300k-output-python-code-examples">Step-by-Step: Using the Message Batches API for 300K Output (Python Code Examples)</h2>
<p>The Message Batches API is an asynchronous endpoint that accepts up to 10,000 requests per batch, processes them within 24 hours, and returns results as a JSONL stream. Using it for 300K output requires three steps: submit the batch, poll for completion, and stream results.</p>
<p><strong>Step 1 — Submit the batch:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> anthropic
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>client <span style="color:#f92672">=</span> anthropic<span style="color:#f92672">.</span>Anthropic()
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>batch <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>create(
</span></span><span style="display:flex;"><span>    requests<span style="color:#f92672">=</span>[
</span></span><span style="display:flex;"><span>        {
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;custom_id&#34;</span>: <span style="color:#e6db74">&#34;doc-generation-001&#34;</span>,
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">&#34;params&#34;</span>: {
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;model&#34;</span>: <span style="color:#e6db74">&#34;claude-sonnet-4-6&#34;</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;max_tokens&#34;</span>: <span style="color:#ae81ff">300000</span>,
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;system&#34;</span>: (
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;You are a senior technical writer. Generate complete, &#34;</span>
</span></span><span style="display:flex;"><span>                    <span style="color:#e6db74">&#34;exhaustive documentation with no truncation.&#34;</span>
</span></span><span style="display:flex;"><span>                ),
</span></span><span style="display:flex;"><span>                <span style="color:#e6db74">&#34;messages&#34;</span>: [
</span></span><span style="display:flex;"><span>                    {
</span></span><span style="display:flex;"><span>                        <span style="color:#e6db74">&#34;role&#34;</span>: <span style="color:#e6db74">&#34;user&#34;</span>,
</span></span><span style="display:flex;"><span>                        <span style="color:#e6db74">&#34;content&#34;</span>: <span style="color:#e6db74">&#34;Write a complete 200-page API reference guide for...&#34;</span>,
</span></span><span style="display:flex;"><span>                    }
</span></span><span style="display:flex;"><span>                ],
</span></span><span style="display:flex;"><span>            },
</span></span><span style="display:flex;"><span>        }
</span></span><span style="display:flex;"><span>    ],
</span></span><span style="display:flex;"><span>    betas<span style="color:#f92672">=</span>[<span style="color:#e6db74">&#34;output-300k-2026-03-24&#34;</span>],
</span></span><span style="display:flex;"><span>)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Batch ID: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>id<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Status: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>processing_status<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p><strong>Step 2 — Poll for completion:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> time
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">wait_for_batch</span>(client, batch_id, poll_interval<span style="color:#f92672">=</span><span style="color:#ae81ff">60</span>):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span>        batch <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>retrieve(batch_id)
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> batch<span style="color:#f92672">.</span>processing_status <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;ended&#34;</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span> batch
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Status: </span><span style="color:#e6db74">{</span>batch<span style="color:#f92672">.</span>processing_status<span style="color:#e6db74">}</span><span style="color:#e6db74"> — waiting </span><span style="color:#e6db74">{</span>poll_interval<span style="color:#e6db74">}</span><span style="color:#e6db74">s&#34;</span>)
</span></span><span style="display:flex;"><span>        time<span style="color:#f92672">.</span>sleep(poll_interval)
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span>completed <span style="color:#f92672">=</span> wait_for_batch(client, batch<span style="color:#f92672">.</span>id)
</span></span><span style="display:flex;"><span>print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Request counts: </span><span style="color:#e6db74">{</span>completed<span style="color:#f92672">.</span>request_counts<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p><strong>Step 3 — Stream results:</strong></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">for</span> result <span style="color:#f92672">in</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>results(batch<span style="color:#f92672">.</span>id):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;succeeded&#34;</span>:
</span></span><span style="display:flex;"><span>        content <span style="color:#f92672">=</span> result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>message<span style="color:#f92672">.</span>content[<span style="color:#ae81ff">0</span>]<span style="color:#f92672">.</span>text
</span></span><span style="display:flex;"><span>        token_count <span style="color:#f92672">=</span> result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>message<span style="color:#f92672">.</span>usage<span style="color:#f92672">.</span>output_tokens
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;ID: </span><span style="color:#e6db74">{</span>result<span style="color:#f92672">.</span>custom_id<span style="color:#e6db74">}</span><span style="color:#e6db74"> | Tokens: </span><span style="color:#e6db74">{</span>token_count<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span><span style="display:flex;"><span>        <span style="color:#75715e"># Write content to file or database</span>
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">with</span> open(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;</span><span style="color:#e6db74">{</span>result<span style="color:#f92672">.</span>custom_id<span style="color:#e6db74">}</span><span style="color:#e6db74">.txt&#34;</span>, <span style="color:#e6db74">&#34;w&#34;</span>) <span style="color:#66d9ef">as</span> f:
</span></span><span style="display:flex;"><span>            f<span style="color:#f92672">.</span>write(content)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">elif</span> result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;errored&#34;</span>:
</span></span><span style="display:flex;"><span>        print(<span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Error on </span><span style="color:#e6db74">{</span>result<span style="color:#f92672">.</span>custom_id<span style="color:#e6db74">}</span><span style="color:#e6db74">: </span><span style="color:#e6db74">{</span>result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>error<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>)
</span></span></code></pre></div><p>Batch requests do not count against your synchronous rate limits, which means you can submit large batches without affecting production API throughput.</p>
<h2 id="synchronous-vs-asynchronous-when-to-use-64k-vs-300k-output">Synchronous vs Asynchronous: When to Use 64K vs 300K Output</h2>
<p>Choosing between the synchronous API (64K max) and the Batches API (300K max) depends on your latency tolerance, UX requirements, and batch size. Both have valid use cases, and the wrong choice for your workload can cost 2× more or introduce unacceptable delays.</p>
<table>
  <thead>
      <tr>
          <th>Criterion</th>
          <th>Synchronous API (64K)</th>
          <th>Batches API (300K)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Latency</td>
          <td>Seconds to minutes</td>
          <td>Up to 24 hours</td>
      </tr>
      <tr>
          <td>Output limit</td>
          <td>64K tokens</td>
          <td>300K tokens</td>
      </tr>
      <tr>
          <td>Cost</td>
          <td>Standard rates</td>
          <td>50% discount</td>
      </tr>
      <tr>
          <td>Use case</td>
          <td>Real-time chatbots, streaming</td>
          <td>Offline pipelines, bulk generation</td>
      </tr>
      <tr>
          <td>Rate limits</td>
          <td>Shared with other sync calls</td>
          <td>Separate batch limits</td>
      </tr>
      <tr>
          <td>Result delivery</td>
          <td>Streaming or immediate JSON</td>
          <td>JSONL result stream after completion</td>
      </tr>
  </tbody>
</table>
<p><strong>Use synchronous (64K) when:</strong> your application needs real-time responses, you&rsquo;re building a user-facing chatbot or coding assistant, the output fits within 64K tokens, or you require streaming. The synchronous API also supports prompt caching with identical cache TTLs.</p>
<p><strong>Use batch (300K) when:</strong> you&rsquo;re running nightly document generation jobs, processing large queues of independent requests, generating codebases or technical books, or doing cost-sensitive bulk work where 50% cost savings justify async delivery.</p>
<p>The inflection point is roughly this: if you need more than 64K output tokens OR you have more than 100 independent requests to process, the Batches API is almost always the right choice.</p>
<h2 id="real-world-use-cases-for-300k-output-tokens">Real-World Use Cases for 300K Output Tokens</h2>
<p>The 300K output token limit unlocks document and code generation tasks that were previously impossible in a single API call. At 195,000 words per request, the realistic use cases span legal, engineering, and content production workflows.</p>
<p><strong>Full codebase migration:</strong> A 100,000-line Python 2 → Python 3 migration can be completed in a single batch call using Opus 4.7. Feed the entire codebase as input context (leveraging the 1M-token context window on Sonnet 4.6), instruct Claude to output the migrated code, and receive the complete refactored codebase in one response — no chunking, no stitching, no context loss between chunks.</p>
<p><strong>Legal document drafting:</strong> A 200-page commercial contract, complete with exhibits and schedules, runs approximately 130,000 tokens. A batch call with <code>max_tokens: 200000</code> generates the entire document with consistent clause numbering, cross-references, and defined terms — something that breaks down badly when chunked across multiple calls.</p>
<p><strong>Technical book generation:</strong> A 60-chapter technical manual at 2,500 words per chapter totals ~150,000 tokens. With 300K headroom, Claude can generate the complete book plus appendices and index in a single request, maintaining consistent terminology and style throughout.</p>
<p><strong>Codebase documentation:</strong> Generating JSDoc or docstring coverage for a large TypeScript monorepo (50K+ LOC) at 300K output means the entire documentation pass completes in one batch job rather than hundreds of chunked API calls.</p>
<p><strong>Test suite generation:</strong> Generating comprehensive test suites for a backend API — unit tests, integration tests, fixture data — can easily hit 100K+ tokens when complete. One batch call produces the full test coverage.</p>
<h2 id="cost-analysis-300k-batch-vs-multi-call-chunking-approach">Cost Analysis: 300K Batch vs Multi-Call Chunking Approach</h2>
<p>The economic case for the Batches API is compelling for large-scale generation. The 50% cost discount on batch calls, combined with eliminating chunking overhead, often reduces total cost by 60–75% for workloads that previously required multi-call approaches.</p>
<p><strong>Example: Generate a 200-page technical document (~150K output tokens)</strong></p>
<p><em>Multi-call chunking approach (synchronous):</em></p>
<ul>
<li>3 calls × 50K output tokens each = 150K total output tokens</li>
<li>Cost at Sonnet 4.6 standard rates ($15.00/MTok output): $2.25</li>
<li>Plus: 3× input tokens re-sent per call (repeated context) ≈ 2× input overhead</li>
<li>Total input overhead: ~300K extra input tokens × $3.00/MTok = $0.90 additional</li>
<li><strong>Total: ~$3.15</strong></li>
</ul>
<p><em>Single batch call (300K output):</em></p>
<ul>
<li>1 batch call × 150K output tokens</li>
<li>Cost at Sonnet 4.6 batch rates ($7.50/MTok output): $1.125</li>
<li>Input once at batch rates ($1.50/MTok): input cost ~60% lower</li>
<li><strong>Total: ~$1.35</strong></li>
</ul>
<p>The batch approach saves approximately <strong>57%</strong> in this scenario — and the savings grow with output volume because you eliminate redundant context re-submission entirely. For pipelines generating millions of tokens per day, this difference is significant: at 100M output tokens/month, you save ~$750,000/year switching from chunked sync calls to single batch calls.</p>
<p>The trade-off is latency: batch results arrive within 24 hours, not seconds. For offline pipelines and nightly jobs, this is irrelevant.</p>
<h2 id="claude-300k-output-vs-competitors-gpt-4o-gemini">Claude 300K Output vs Competitors (GPT-4o, Gemini)</h2>
<p>Claude&rsquo;s 300K batch output limit is the highest of any major LLM API as of April 2026. Understanding how this compares to OpenAI and Google&rsquo;s offerings helps justify the architectural choice.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Max Output Tokens</th>
          <th>Batch Discount</th>
          <th>Context Window</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Claude Sonnet 4.6 (batch)</td>
          <td>300,000</td>
          <td>50%</td>
          <td>1,000,000</td>
      </tr>
      <tr>
          <td>Claude Opus 4.7 (batch)</td>
          <td>300,000</td>
          <td>50%</td>
          <td>200,000</td>
      </tr>
      <tr>
          <td>GPT-4o Long Output</td>
          <td>64,000</td>
          <td>~50% (Batch API)</td>
          <td>128,000</td>
      </tr>
      <tr>
          <td>GPT-4.1</td>
          <td>32,768</td>
          <td>~50% (Batch API)</td>
          <td>1,000,000</td>
      </tr>
      <tr>
          <td>Gemini 1.5 Pro</td>
          <td>8,192</td>
          <td>No batch API</td>
          <td>1,000,000</td>
      </tr>
      <tr>
          <td>Gemini 2.0 Flash</td>
          <td>8,192</td>
          <td>No batch API</td>
          <td>1,000,000</td>
      </tr>
  </tbody>
</table>
<p>Claude holds two structural advantages: a 4.7× higher output ceiling than GPT-4o&rsquo;s best offering, and a 1M-token context window on Sonnet 4.6 that means you can feed massive input documents and get massive outputs in the same call. GPT-4o&rsquo;s Long Output variant is capped at 64K tokens and requires the same type of beta access — but never exceeds that ceiling.</p>
<p>Beyond raw limits, Claude maintains instruction-following quality at 150K+ output tokens in ways GPT-4o does not. In benchmarks comparing 150K+ token outputs, Claude Sonnet 4.6 maintains consistent formatting, numbering, and terminology through the full response; GPT-4o shows measurable degradation past ~100K tokens including repetition, dropped sections, and inconsistent formatting.</p>
<p>For Gemini users: Google has not released a batch API with output discounts as of April 2026, and Gemini&rsquo;s 8K output ceiling makes it unsuitable for any long-form generation use case regardless of context window size.</p>
<h2 id="best-practices-for-long-form-generation-with-claude-api">Best Practices for Long-Form Generation with Claude API</h2>
<p>Reliably filling 300K output tokens with high-quality content requires prompt engineering strategies different from short-form generation. Claude can generate 300K tokens, but naive prompts produce verbose padding rather than substantive content.</p>
<p><strong>Be explicit about expected length and structure.</strong> Instruct Claude upfront: &ldquo;Generate a complete 180,000-word technical manual. Do not summarize or truncate. Every chapter must contain at least 15,000 words.&rdquo; Without explicit length targets, Claude optimizes for completeness-per-token rather than absolute token count.</p>
<p><strong>Use structured output templates.</strong> Provide a skeleton outline in your prompt with placeholders. Claude fills placeholders more reliably than generating structure and content simultaneously for very long outputs. For a 200-chapter book, include chapter headers and minimum section requirements in the prompt.</p>
<p><strong>Split into sections with explicit continuation markers.</strong> For documents with natural divisions (chapters, modules, sections), instruct Claude to mark section boundaries explicitly: <code>[SECTION: Chapter 5 — Authentication]</code>. This makes downstream parsing trivial and ensures section-level completeness.</p>
<p><strong>Use prompt caching for repeated system prompts.</strong> If you&rsquo;re running multiple batch requests with the same style guide or system instructions, add <code>&quot;cache_control&quot;: {&quot;type&quot;: &quot;ephemeral&quot;}</code> to those blocks. Cached tokens cost 10% of standard input rates on read, reducing input costs by 90% for the repeated portions.</p>
<p><strong>Verify output completeness with token counts.</strong> Check <code>usage.output_tokens</code> in your batch results. If a 300K-token request returns 85K tokens, the model may have interpreted the task as complete before filling the budget. Adjust prompts to signal that incomplete output is an error.</p>
<p><strong>Handle batch errors gracefully.</strong> Batch results include per-request success/error status. Implement retry logic for <code>errored</code> results, and log <code>overloaded_error</code> vs <code>invalid_request_error</code> separately — the former is transient (retry), the latter requires prompt correction.</p>
<h2 id="rate-limits-polling-and-handling-batch-results">Rate Limits, Polling, and Handling Batch Results</h2>
<p>The Message Batches API has its own rate limit tier separate from the synchronous API. Understanding these limits prevents batch submission failures and ensures efficient polling patterns.</p>
<p><strong>Batch submission limits:</strong> Each batch can contain up to 10,000 requests. The total token volume per batch is limited to your organization&rsquo;s token quota. For very large workloads, split across multiple batches and submit sequentially or in parallel depending on your quota headroom.</p>
<p><strong>Polling best practices:</strong> Poll batch status no more frequently than every 60 seconds. The Anthropic SDK does not include built-in polling — implement exponential backoff starting at 60 seconds for long-running batches:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#f92672">import</span> time
</span></span><span style="display:flex;"><span><span style="color:#f92672">import</span> anthropic
</span></span><span style="display:flex;"><span>
</span></span><span style="display:flex;"><span><span style="color:#66d9ef">def</span> <span style="color:#a6e22e">poll_with_backoff</span>(client, batch_id):
</span></span><span style="display:flex;"><span>    interval <span style="color:#f92672">=</span> <span style="color:#ae81ff">60</span>
</span></span><span style="display:flex;"><span>    max_interval <span style="color:#f92672">=</span> <span style="color:#ae81ff">600</span>  <span style="color:#75715e"># 10 minutes max</span>
</span></span><span style="display:flex;"><span>    
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">while</span> <span style="color:#66d9ef">True</span>:
</span></span><span style="display:flex;"><span>        batch <span style="color:#f92672">=</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>retrieve(batch_id)
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        <span style="color:#66d9ef">if</span> batch<span style="color:#f92672">.</span>processing_status <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;ended&#34;</span>:
</span></span><span style="display:flex;"><span>            <span style="color:#66d9ef">return</span> batch
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        counts <span style="color:#f92672">=</span> batch<span style="color:#f92672">.</span>request_counts
</span></span><span style="display:flex;"><span>        print(
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Processing: </span><span style="color:#e6db74">{</span>counts<span style="color:#f92672">.</span>processing<span style="color:#e6db74">}</span><span style="color:#e6db74"> | &#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Succeeded: </span><span style="color:#e6db74">{</span>counts<span style="color:#f92672">.</span>succeeded<span style="color:#e6db74">}</span><span style="color:#e6db74"> | &#34;</span>
</span></span><span style="display:flex;"><span>            <span style="color:#e6db74">f</span><span style="color:#e6db74">&#34;Errored: </span><span style="color:#e6db74">{</span>counts<span style="color:#f92672">.</span>errored<span style="color:#e6db74">}</span><span style="color:#e6db74">&#34;</span>
</span></span><span style="display:flex;"><span>        )
</span></span><span style="display:flex;"><span>        
</span></span><span style="display:flex;"><span>        time<span style="color:#f92672">.</span>sleep(interval)
</span></span><span style="display:flex;"><span>        interval <span style="color:#f92672">=</span> min(interval <span style="color:#f92672">*</span> <span style="color:#ae81ff">1.5</span>, max_interval)
</span></span></code></pre></div><p><strong>Streaming batch results:</strong> Use <code>.results()</code> to stream the JSONL output file rather than loading all results into memory at once. This is critical for large batches — a 10,000-request batch result file can be several GB:</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#66d9ef">for</span> result <span style="color:#f92672">in</span> client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>results(batch_id):
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">if</span> result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;succeeded&#34;</span>:
</span></span><span style="display:flex;"><span>        process_result(result)
</span></span><span style="display:flex;"><span>    <span style="color:#66d9ef">elif</span> result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>type <span style="color:#f92672">==</span> <span style="color:#e6db74">&#34;errored&#34;</span>:
</span></span><span style="display:flex;"><span>        log_error(result<span style="color:#f92672">.</span>custom_id, result<span style="color:#f92672">.</span>result<span style="color:#f92672">.</span>error)
</span></span></code></pre></div><p><strong>Result retention:</strong> Anthropic retains batch results for 29 days. Download and store results in your own infrastructure before the 29-day window expires — after that, the results are permanently deleted.</p>
<p><strong>Cancellation:</strong> Batches can be cancelled while in <code>processing</code> status. Requests that completed before cancellation are billed normally; unprocessed requests are not charged.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-python" data-lang="python"><span style="display:flex;"><span><span style="color:#75715e"># Cancel a batch if needed</span>
</span></span><span style="display:flex;"><span>client<span style="color:#f92672">.</span>beta<span style="color:#f92672">.</span>messages<span style="color:#f92672">.</span>batches<span style="color:#f92672">.</span>cancel(batch_id)
</span></span></code></pre></div><h2 id="faq-common-questions-about-claude-api-300k-output">FAQ: Common Questions About Claude API 300K Output</h2>
<p><strong>Q: Can I get 300K output on the synchronous <code>/v1/messages</code> API?</strong></p>
<p>No. The 300K output limit is exclusively available through the Message Batches API (<code>/v1/messages/batches</code>) with the <code>output-300k-2026-03-24</code> beta header. The synchronous API is hard-capped at 64K tokens on Sonnet 4.6 and 32K on Opus 4.6/4.7 regardless of what you pass as <code>max_tokens</code>. This architectural separation exists because generating 300K tokens takes significantly longer than synchronous request timeouts allow.</p>
<p><strong>Q: Does the output-300k-2026-03-24 header expire?</strong></p>
<p>Beta headers in Anthropic&rsquo;s API are dated to indicate when the feature was introduced, not when it expires. The <code>output-300k-2026-03-24</code> header is the current stable way to enable 300K output as of April 2026. Anthropic will provide migration guidance if the header format changes. Monitor the official <a href="https://platform.claude.com/docs/en/about-claude/models/overview">model documentation</a> for updates.</p>
<p><strong>Q: Does prompt caching work with 300K batch output requests?</strong></p>
<p>Yes. You can combine <code>cache_control: {&quot;type&quot;: &quot;ephemeral&quot;}</code> blocks with the <code>output-300k-2026-03-24</code> beta header. Cache reads cost 10% of standard input token rates, and writes cost 25%. This is particularly valuable for long system prompts (style guides, schemas, context documents) that repeat across many batch requests.</p>
<p><strong>Q: How long does a 300K token batch response take to generate?</strong></p>
<p>Processing time depends on current batch queue depth and the specific model. In practice, a single 300K-token request typically completes in 15–45 minutes once it begins processing, but queue wait time can extend total wall-clock time to several hours. Anthropic guarantees results within 24 hours. For time-sensitive workloads, the synchronous API with 64K output is more predictable.</p>
<p><strong>Q: What happens if my prompt isn&rsquo;t complex enough to fill 300K tokens?</strong></p>
<p>Claude generates the natural length for your task — it does not pad output to reach <code>max_tokens</code>. If you need a minimum output length, instruct Claude explicitly: &ldquo;This document must be at least 150,000 words. Include exhaustive detail, worked examples for each section, and annotated code listings.&rdquo; The <code>max_tokens</code> parameter sets a ceiling, not a floor.</p>
]]></content:encoded></item></channel></rss>