Batch API

What Is Batch API?
OpenAI Batch API
Anthropic Message Batches
When to Use
Implementation Pattern
Monitoring & Error Handling

SECTION 01

What Is Batch API?

Batch APIs accept a file of requests, process them asynchronously when compute is available (typically overnight), and return a results file. In exchange for surrendering latency SLAs, you get 50% cost reduction. Ideal for workloads that aren't user-facing: data labelling, content generation, document processing, embedding computation, eval runs.

SECTION 02

OpenAI Batch API

OpenAI's batch API accepts JSONL files of chat completion requests with a custom_id field " "for correlating results. Maximum 50,000 requests per batch, 100MB file size.

import json
from openai import OpenAI
client = OpenAI()
# 1. Prepare batch file
requests = []
for i, doc in enumerate(documents):
    requests.append({
        "custom_id": f"doc-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o-mini",
            "messages": [{"role": "user", "content": f"Summarise: {doc[:2000]}"}],
            "max_tokens": 200,
        }
    })
# 2. Upload file
with open("batch_input.jsonl", "w") as f:
    for r in requests:
        f.write(json.dumps(r) + "\n")
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
# 3. Create batch
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print(f"Batch ID: {batch.id}")

SECTION 03

Anthropic Message Batches

Anthropic offers the same pattern for Claude models with up to 10,000 requests per batch.

import anthropic, json
client = anthropic.Anthropic()
requests = [
    {
        "custom_id": f"req-{i}",
        "params": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 200,
            "messages": [{"role": "user", "content": f"Classify sentiment: {text}"}],
        }
    }
    for i, text in enumerate(texts)
]
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
# Poll for completion
import time
while True:
    status = client.messages.batches.retrieve(batch.id)
    if status.processing_status == "ended":
        break
    time.sleep(60)
# Download results
results = {}
for result in client.messages.batches.results(batch.id):
    results[result.custom_id] = result.result.message.content[0].text

SECTION 04

When to Use

Ideal: offline enrichment (labelling, summarisation, embedding), nightly eval runs, content generation pipelines without real-time requirements, A/B test data generation. Avoid: user-facing features, anything with <1 minute latency requirement, streaming responses, or workloads requiring retries on individual request failures.

SECTION 05

Implementation Pattern

Split work into batches of 1,000–10,000 requests. Submit at end of business day, retrieve next morning. Store custom_id → original record mapping in your DB. Build idempotent re-submission logic: if batch fails, re-submit only failed custom_ids. Use a job queue (Celery, RQ) to orchestrate submission and retrieval.

SECTION 06

Monitoring & Error Handling

Track: batch completion rate (some requests may fail individually), total batch cost, average turnaround time, and partial failure rate. Results files include per-request error codes — process errors separately. Set a fallback: if batch fails to complete within 20 hours, re-submit as synchronous requests.

SECTION 07

Batch Job Lifecycle & Monitoring

Batch jobs progress through states: QUEUED → PROCESSING → COMPLETED or FAILED. Monitoring transitions is critical; stuck jobs indicate queue saturation or service outages. Modern batch systems expose metrics like job queue depth, average wait time, and SLA compliance (e.g., "95% of jobs complete within 2 hours").

# Example: Anthropic Batch API job lifecycle
import anthropic
import time

client = anthropic.Anthropic()
batch = client.messages.batches.create(requests=[...])
print(f"Batch {batch.id} created. State: {batch.processing_status}")

# Poll until complete
while batch.processing_status in ["queued", "in_progress"]:
    time.sleep(30)
    batch = client.messages.batches.retrieve(batch.id)
    print(f"Batch {batch.id} state: {batch.processing_status}")

# Retrieve results
if batch.processing_status == "succeeded":
    results = client.messages.batches.results(batch.id)
    for result in results:
        print(result.result)

Cost Optimization via Batching

Batch APIs often charge 50% less per token than real-time APIs, making them ideal for non-urgent bulk processing. Group requests by priority: urgent batches run within 1 hour, standard batches within 24 hours. This two-tier approach maximizes savings while meeting SLAs.

# Prepare batch requests efficiently
requests = []
for doc in documents:
    requests.append({
        "custom_id": doc.id,
        "params": {
            "model": "claude-3-5-sonnet-20241022",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": f"Summarize: {doc.text}"}]
        }
    })

# Submit batch (receives 50% discount vs real-time)
batch = client.messages.batches.create(requests=requests)

SECTION 08

Error Handling & Retry Strategies

Batch jobs encounter transient failures (temporary service outages) and permanent failures (invalid input). Implement exponential backoff for retries; don't immediately re-queue a failed job. Failed individual requests within a batch should be logged separately so you can investigate and resubmit specific failures without re-processing successful ones.

Error Type	Status Code	Retry?	Action
Rate Limit	429	Yes (exponential backoff)	Retry after 60s, then 120s, then 240s
Service Unavailable	503	Yes (exponential backoff)	Retry after 30s, then 60s, up to 5 retries
Invalid Input	400	No	Log error, skip this request, continue batch
Auth Failed	401	No	Stop batch, check API key/credentials
Timeout	504	Yes (exponential backoff)	Retry with increased timeout window

Batch Scheduling & Resource Allocation: Batch services manage a queue of jobs competing for limited compute resources. Implement fair scheduling: don't let one large job monopolize the queue. Use priority levels (P1 urgent, P2 standard, P3 background) with separate queues; process P1 jobs first, but always reserve some capacity for P2/P3 to prevent starvation. For multi-tenant systems, add resource quotas per tenant: "Team A gets 10 concurrent jobs max, Team B gets 5" prevents one team from consuming all capacity. Monitor queue depth and latency; if queue exceeds 5000 jobs, activate auto-scaling or notify ops to add more workers.

Implement progressive batch execution: don't wait for all results before returning to the user. As each individual result completes, stream it to the client via server-sent events (SSE) or webhook. This improves user experience and enables fault tolerance: if processing stops halfway, the user sees partial results and can retry the batch for the remaining items.

Monitoring and observability are essential for production systems. Set up comprehensive logging at every layer: API requests, model predictions, database queries, cache hits/misses. Use structured logging (JSON) to enable filtering and aggregation across thousands of servers. For production deployments, track not just errors but also latency percentiles (p50, p95, p99); if p99 latency suddenly doubles, something is wrong even if error rates are normal. Set up alerting based on SLO violations: if a service is supposed to have 99.9% availability and it drops to 99.5%, alert immediately. Use distributed tracing (Jaeger, Lightstep) to track requests across multiple services; a slow end-to-end latency might be hidden in one deep service call, invisible in aggregate metrics.

For long-running ML jobs (training, batch inference), implement checkpoint recovery and graceful degradation. If a training job crashes after 2 weeks, you want to resume from the last checkpoint, not restart from scratch. Implement job orchestration with Kubernetes or Airflow to handle retries, resource allocation, and dependency management. Use feature flags for safe deployment: deploy new model versions behind a flag that's off by default, gradually roll out to 1% of users, 10%, then 100%, monitoring metrics at each step. If something goes wrong, flip the flag back instantly. This approach reduces risk and enables fast rollback.

Finally, build a culture of incident response and post-mortems. When something breaks (and it will), document the incident: timeline, root cause, mitigation steps, and preventive measures. Use incidents as learning opportunities; blameless post-mortems focus on systems, not people. Share findings across teams to prevent repeat incidents. A well-documented incident history is an organization's institutional knowledge about system failures and how to avoid them.

The rapid evolution of AI infrastructure requires continuous learning and adaptation. Teams should establish regular tech talks and knowledge-sharing sessions where engineers present lessons learned from production deployments, performance optimization work, and incident postmortems. Create internal wiki pages documenting best practices specific to your organization: how to debug common failure modes, performance tuning guides for your hardware, and checklists for safe deployments. This prevents repeating mistakes and accelerates onboarding of new team members.

Build relationships with vendors and open-source communities. If you encounter bugs in frameworks (PyTorch, JAX), file detailed reports. If you have questions, ask on forums; community members often have encountered similar issues. For mission-critical infrastructure, consider purchasing support contracts with vendors (PyTorch, HuggingFace, cloud providers). Support gives you direct access to engineers who understand your system and can prioritize fixes. This is insurance against production outages caused by third-party software bugs.

Finally, remember that optimization is a journey, not a destination. Today's cutting-edge technique becomes tomorrow's baseline. Allocate 10-15% of engineering time to exploration and experimentation. Some experiments will fail, but successful ones compound into significant efficiency gains. Foster a culture of continuous improvement: measure, analyze, iterate, and share results. The teams that stay ahead are those that invest in understanding their systems deeply and adapting proactively to new technologies and changing demands.