Gemini Flash-Lite Batch API cuts your LLM costs in half by processing requests asynchronously — submit a JSONL file, get results back within 24 hours, and pay $0.125/1M input tokens instead of $0.25. For teams running thousands of daily classification, translation, or summarization jobs, this single change can reduce monthly AI spend from hundreds of dollars to tens.
What Is the Gemini Batch API and Why Does It Matter
The Gemini Batch API is Google’s asynchronous processing mode that applies a 50% discount on all paid Gemini models for non-real-time workloads. Instead of sending individual HTTP requests and waiting for each response, you package hundreds or thousands of requests into a JSONL file, submit it as a batch job, and retrieve results once the job completes — typically well under 24 hours. Launched alongside the Gemini 3 family in early 2026, the Batch API targets the large class of AI tasks where latency is irrelevant: overnight content moderation queues, bulk data extraction pipelines, weekly report generation, and offline document analysis. The mechanism is simple: Google processes your batch during off-peak capacity windows, passes the savings directly to you, and guarantees completion within one day. For startups and enterprises alike, this transforms formerly expensive batch pipelines into genuinely affordable infrastructure. At $0.125/1M input tokens with Flash-Lite, you can process an entire Wikipedia-scale corpus for under $10 — a threshold that makes previously cost-prohibitive use cases like fine-tuning dataset generation or full-catalog product description rewrites financially viable.
Gemini 3.1 Flash-Lite: The Cost-Efficiency Champion for Batch Workloads
Gemini 3.1 Flash-Lite is Google’s most cost-efficient model in the Gemini 3 series, launched in preview on March 3, 2026 via Google AI Studio and Vertex AI. It delivers 328 tokens per second throughput — 2.5x faster than Gemini 2.5 Flash — with a 1,048,576-token context window and 65,536-token max output. Despite its “lite” branding, Flash-Lite scores 86.9% on GPQA Diamond and holds an Elo score of 1432 on the Arena.ai Leaderboard, placing it above many models positioned as “standard” tier. The “lite” designation means optimized for throughput and cost, not capability reduction — it handles complex reasoning tasks competently while shaving 80% off costs versus prior-generation equivalents. For batch workloads specifically, Flash-Lite’s speed advantage matters less than its price floor: at $0.25/1M input tokens standard, dropping to $0.125/1M with Batch API, it sets the lowest cost-per-token floor in Google’s current model lineup. Teams processing millions of tokens daily through classification or extraction pipelines should default to Flash-Lite unless a specific task demonstrably requires a larger model.
Batch API Pricing Breakdown — Exact Numbers for Flash-Lite in 2026
The Gemini Batch API applies a flat 50% discount to both input and output tokens across all eligible models. For Gemini 3.1 Flash-Lite specifically, this produces the following effective rates:
| Mode | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Standard (real-time) | $0.25 | $1.50 |
| Batch API | $0.125 | $0.75 |
| Batch + Prompt Cache | ~$0.016 | $0.75 |
The third row — stacking batch discount with prompt caching — is where costs become almost negligible. Prompt caching on Gemini applies a further 75% discount on cached input tokens, and cache hits on batched requests stack multiplicatively. In practice, workloads with a large shared system prompt (common in classification and extraction tasks) hit cache rates above 80%, pulling effective input costs down to roughly $0.016/1M tokens. At that rate, processing one million tokens costs less than two cents. Compared to OpenAI’s Batch API (50% off GPT-4o Mini at $0.075/1M input) and Anthropic’s Message Batches API (50% off Claude Haiku 3.5 at $0.04/1M input), Gemini Flash-Lite Batch API offers the lowest absolute price floor for high-volume async workloads in 2026.
Step-by-Step: Implementing Gemini Batch API with Python (JSONL Workflow)
The Gemini Batch API uses a JSONL (JSON Lines) format where each line is a self-contained request object containing a unique ID, the HTTP method, the model endpoint URL, and the full request body. This design is intentional: every request in the batch is completely independent, enabling Google to process them in parallel across infrastructure and return results in a consolidated output file. The workflow follows four discrete phases — build the JSONL input file, upload it via the Files API, create a batch job referencing the uploaded file, and poll until the job reaches a terminal state (succeeded, failed, or cancelled). Unlike OpenAI’s Batch API which requires a separate file upload service, Gemini’s implementation uses the same google-generativeai SDK you already use for real-time calls, minimizing the migration surface. In production, the most common mistake is building monolithic batches of 50,000 requests without pagination logic for results — always design your result parser to handle partial files and retry failed individual requests as follow-up jobs. Here’s a complete production-ready implementation covering all four phases:
Installing Dependencies
pip install google-generativeai google-cloud-storage
Building the JSONL Request File
import json
requests = [
{
"custom_id": f"req-{i}",
"method": "POST",
"url": "/v1beta/models/gemini-3.1-flash-lite-preview:generateContent",
"body": {
"contents": [{"role": "user", "parts": [{"text": f"Classify sentiment: {text}"}]}],
"generationConfig": {"temperature": 0, "maxOutputTokens": 50}
}
}
for i, text in enumerate(texts) # texts is your list of inputs
]
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
Submitting the Batch Job
import google.generativeai as genai
import os
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
# Upload the JSONL file
with open("batch_requests.jsonl", "rb") as f:
uploaded_file = genai.upload_file(f, mime_type="application/jsonl")
# Create the batch job
batch_job = genai.create_batch_job(
model="gemini-3.1-flash-lite-preview",
src=uploaded_file.uri,
display_name="sentiment-classification-batch"
)
print(f"Job created: {batch_job.name}")
print(f"Status: {batch_job.state}")
Polling for Completion and Downloading Results
import time
def wait_for_batch(job_name, poll_interval=60):
while True:
job = genai.get_batch_job(job_name)
print(f"State: {job.state} | Completed: {job.completed_count}/{job.request_count}")
if job.state in ("JOB_STATE_SUCCEEDED", "JOB_STATE_FAILED", "JOB_STATE_CANCELLED"):
return job
time.sleep(poll_interval)
completed_job = wait_for_batch(batch_job.name)
# Download results
if completed_job.state == "JOB_STATE_SUCCEEDED":
results = genai.download_batch_results(completed_job.dest)
with open("batch_results.jsonl", "wb") as f:
f.write(results)
Parsing Results
results = []
with open("batch_results.jsonl") as f:
for line in f:
result = json.loads(line)
custom_id = result["custom_id"]
response_text = result["response"]["candidates"][0]["content"]["parts"][0]["text"]
results.append({"id": custom_id, "result": response_text})
The full cycle from submission to results typically runs in 2–6 hours for jobs under 100K requests, far faster than the advertised 24-hour SLA.
H3: Error Handling and Partial Failures
Batch jobs can contain partial failures — individual requests that fail while the overall job succeeds. Always check both job-level and request-level status:
failed = [r for r in results if "error" in r]
succeeded = [r for r in results if "response" in r]
print(f"Success: {len(succeeded)}, Failed: {len(failed)}")
# Retry failed requests as a new batch job
Real Cost Savings Calculator — From 1K to 1M Daily Requests
Assuming an average request is 500 input tokens and 100 output tokens (typical for classification or extraction), here’s what Gemini 3.1 Flash-Lite costs at different daily volumes. The comparison is between standard real-time API versus Batch API versus Batch + 80% cache hit rate:
| Daily Requests | Standard API ($/day) | Batch API ($/day) | Batch + Cache ($/day) |
|---|---|---|---|
| 1,000 | $0.23 | $0.11 | $0.02 |
| 10,000 | $2.30 | $1.15 | $0.18 |
| 100,000 | $22.50 | $11.25 | $1.80 |
| 1,000,000 | $225.00 | $112.50 | $18.00 |
At the 10,000 requests/day scale, the Google-cited figure of ~$3.68/day assumes a mix of Flash-Lite batch and real-time calls for urgent requests. Pure batch + cache at this volume runs closer to $0.18/day — a 93% reduction vs standard API pricing. The 1M requests/day figure is where enterprise decisions get made: $18/day ($540/month) versus $6,750/month is the difference between a viable product and an unsustainable infrastructure cost.
Best Use Cases for Gemini Flash-Lite Batch API
Gemini Flash-Lite Batch API delivers maximum ROI on tasks that are high-volume, latency-tolerant, and structurally repetitive. The 50% discount only applies to async workloads, so the ideal use case is any pipeline where results don’t need to appear within seconds of the user’s action. Classification and tagging at scale is the canonical example: content moderation queues, sentiment analysis of product reviews, ticket routing, and spam detection all process thousands of items where a 4-hour batch turnaround is perfectly acceptable. Translation pipelines — converting product catalogs, documentation, or marketing copy across languages — are another natural fit, especially since the large context window allows translating entire documents in a single request rather than splitting into chunks. Data extraction from unstructured text (scraping structured fields from PDFs, extracting entities from news articles, parsing addresses from form submissions) also benefits enormously: these tasks are I/O bound on token throughput, and Flash-Lite’s 328 tokens/second combined with batch pricing makes them dramatically cheaper than alternatives. Finally, automated report generation, weekly summaries, and offline analytics jobs are ideal — they run on schedules, not in response to user actions, and the batch pattern maps directly onto cron-triggered pipelines.
When NOT to Use Batch API — Understanding the Latency Tradeoff
The Gemini Batch API’s 50% discount comes with one hard constraint: you cannot rely on it for real-time user interactions. Any task where a human is waiting for a response — chatbots, copilot suggestions, live document editing, API responses that unblock downstream application logic — cannot use Batch API because results arrive in minutes to hours, not milliseconds. The same applies to time-sensitive automation: if you need to send a follow-up email within 30 seconds of a form submission, batch processing breaks the required timing. A useful mental model is to ask “would this job be acceptable to run as an overnight batch in a traditional ETL pipeline?” If yes, Batch API is appropriate. If the task needs sub-second or even sub-minute turnaround, stick with the real-time API. A common production pattern is to route requests based on urgency: maintain two client configurations — one for batch (non-urgent, 50% savings) and one for real-time — and let application logic decide which path each request takes based on whether a user is actively waiting. This hybrid architecture typically achieves 60–70% batch routing, cutting effective API spend by 30–35% compared to all-real-time, without any user-visible latency degradation.
Advanced Optimization: Stacking Batch + Prompt Caching for 90%+ Savings
Stacking Gemini’s Batch API discount with prompt caching is the highest-leverage cost optimization available in 2026, capable of reducing effective costs by 90%+ versus synchronous uncached calls. The mechanism works because prompt caching discounts apply to the cached portion of input tokens — typically the system prompt and any shared context — while Batch API discounts apply to total token usage. Both discounts stack multiplicatively, not additively. Here’s how to implement it: structure your prompts so the system instruction and any shared context (classification taxonomy, few-shot examples, document template) appear first and are identical across requests. Gemini’s cache TTL is 1 hour by default (configurable up to 48 hours), so batches submitted within the TTL window benefit from cache hits. A classification job with a 200-token system prompt and 300-token unique input per request achieves ~40% cache hit rate at minimum — and if you pre-warm the cache before submitting the batch, you can drive this to 60–80%. At 80% cache hits: effective input cost = (0.20 × $0.125) + (0.80 × $0.125 × 0.25) = $0.025 + $0.025 = $0.050/1M tokens — a 80% reduction from standard pricing. Combined with the batch 50% discount already applied, this stacks to roughly 90% total savings versus the baseline. The code change is minimal — add a cachedContent reference to your batch request bodies — but the financial impact at scale is substantial.
# Pre-warm cache before batch submission
cache = genai.create_cached_content(
model="gemini-3.1-flash-lite-preview",
system_instruction="You are a sentiment classifier. Return: POSITIVE, NEGATIVE, or NEUTRAL.",
ttl="3600s" # 1 hour, covers typical batch processing window
)
# Reference cache in batch requests
request_body = {
"cachedContent": cache.name,
"contents": [{"role": "user", "parts": [{"text": text}]}]
}
Gemini Batch API vs OpenAI Batch API vs Anthropic — Full Comparison
All three major AI providers offer batch processing discounts, but the specifics — pricing floor, discount structure, format, and SLA — differ meaningfully. Here’s a complete side-by-side for 2026:
| Feature | Gemini Batch API + Flash-Lite | OpenAI Batch API + GPT-4o Mini | Anthropic Batches + Claude Haiku 3.5 |
|---|---|---|---|
| Standard input price | $0.25/1M | $0.15/1M | $0.80/1M |
| Batch input price | $0.125/1M | $0.075/1M | $0.40/1M |
| Standard output price | $1.50/1M | $0.60/1M | $4.00/1M |
| Batch output price | $0.75/1M | $0.30/1M | $2.00/1M |
| Discount | 50% | 50% | 50% |
| SLA / turnaround | 24 hours | 24 hours | 24 hours |
| Context window | 1M tokens | 128K tokens | 200K tokens |
| Input format | JSONL | JSONL | JSONL |
| Prompt caching | Yes (stackable) | Yes | Yes |
| Multimodal input | Yes | Yes | Text only in batch |
For pure price-per-token, GPT-4o Mini Batch wins on input ($0.075 vs $0.125) but loses heavily on output ($0.30 vs $0.75 for GPT-4o Mini, better for Gemini on output per dollar when accounting for quality). Anthropic’s Haiku 3.5 Batch is more expensive on both input and output at equivalent capability tier. The decisive advantage of Gemini Flash-Lite is the 1M-token context window — none of the alternatives come close for long-document batch processing. If your workload involves documents longer than 100K tokens, or requires processing large shared context across requests, Gemini is the only viable batch option. For short-form classification with sub-1K token requests, GPT-4o Mini Batch is marginally cheaper on input; for anything output-heavy, Gemini Flash-Lite wins.
FAQ
Q: Does Gemini Batch API support all Gemini 3.1 models, or only Flash-Lite?
A: As of April 2026, Batch API is available for all paid Gemini models including Gemini 3.1 Pro, Flash, and Flash-Lite. The 50% discount applies across all tiers. Flash-Lite is recommended for batch workloads because the per-token savings in absolute terms are larger when you’re processing millions of tokens daily on an already cost-optimized model.
Q: What happens if my batch job partially fails — do I get charged for failed requests?
A: No. Google only charges for requests that successfully complete token processing. Failed requests (due to safety filters, malformed input, or model errors) are not billed. Always parse your results JSONL for error entries and re-queue failures as a follow-up batch job.
Q: Can I cancel a batch job after submission, and will I be charged?
A: Yes, you can cancel via the API using genai.cancel_batch_job(job_name). You will only be charged for requests that completed processing before the cancellation takes effect — requests that hadn’t started or were in queue are not billed.
Q: Is there a minimum or maximum batch size for the Gemini Batch API?
A: There’s no documented minimum. Maximum batch size is 50,000 requests per job as of the 2026 documentation. For larger workloads, split into multiple concurrent jobs — there’s no documented limit on simultaneous batch jobs per project, though rate limits on job creation apply.
Q: How does Gemini Batch API compare to running a self-hosted open-source model for high-volume workloads?
A: At $0.125/1M input tokens with batch API, Gemini Flash-Lite undercuts the total cost of ownership for most self-hosted alternatives when accounting for compute, storage, engineering time, and reliability SLAs. Running Llama 3.3 70B on H100 instances costs roughly $0.40–0.60/1M tokens at inference-optimized throughput — 3–5x more expensive than Gemini Flash-Lite Batch for equivalent capability tasks. Self-hosting makes economic sense only at extremely high volumes (50M+ tokens/day) where contract pricing becomes available, or for use cases with strict data residency requirements that cloud APIs can’t satisfy.
