Asynchronous batch processing APIs offered by model providers that process large volumes of requests at 50% lower cost in exchange for up to 24-hour turnaround time.
Batch APIs accept a file of requests, process them asynchronously when compute is available (typically overnight), and return a results file. In exchange for surrendering latency SLAs, you get 50% cost reduction. Ideal for workloads that aren't user-facing: data labelling, content generation, document processing, embedding computation, eval runs.
OpenAI's batch API accepts JSONL files of chat completion requests with a custom_id field " "for correlating results. Maximum 50,000 requests per batch, 100MB file size.
import json
from openai import OpenAI
client = OpenAI()
# 1. Prepare batch file
requests = []
for i, doc in enumerate(documents):
requests.append({
"custom_id": f"doc-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": f"Summarise: {doc[:2000]}"}],
"max_tokens": 200,
}
})
# 2. Upload file
with open("batch_input.jsonl", "w") as f:
for r in requests:
f.write(json.dumps(r) + "\n")
batch_file = client.files.create(file=open("batch_input.jsonl", "rb"), purpose="batch")
# 3. Create batch
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print(f"Batch ID: {batch.id}")
Anthropic offers the same pattern for Claude models with up to 10,000 requests per batch.
import anthropic, json
client = anthropic.Anthropic()
requests = [
{
"custom_id": f"req-{i}",
"params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": 200,
"messages": [{"role": "user", "content": f"Classify sentiment: {text}"}],
}
}
for i, text in enumerate(texts)
]
batch = client.messages.batches.create(requests=requests)
print(f"Batch ID: {batch.id}")
# Poll for completion
import time
while True:
status = client.messages.batches.retrieve(batch.id)
if status.processing_status == "ended":
break
time.sleep(60)
# Download results
results = {}
for result in client.messages.batches.results(batch.id):
results[result.custom_id] = result.result.message.content[0].text
Ideal: offline enrichment (labelling, summarisation, embedding), nightly eval runs, content generation pipelines without real-time requirements, A/B test data generation. Avoid: user-facing features, anything with <1 minute latency requirement, streaming responses, or workloads requiring retries on individual request failures.
Split work into batches of 1,000–10,000 requests. Submit at end of business day, retrieve next morning. Store custom_id → original record mapping in your DB. Build idempotent re-submission logic: if batch fails, re-submit only failed custom_ids. Use a job queue (Celery, RQ) to orchestrate submission and retrieval.
Track: batch completion rate (some requests may fail individually), total batch cost, average turnaround time, and partial failure rate. Results files include per-request error codes — process errors separately. Set a fallback: if batch fails to complete within 20 hours, re-submit as synchronous requests.
Batch jobs progress through states: QUEUED → PROCESSING → COMPLETED or FAILED. Monitoring transitions is critical; stuck jobs indicate queue saturation or service outages. Modern batch systems expose metrics like job queue depth, average wait time, and SLA compliance (e.g., "95% of jobs complete within 2 hours").
# Example: Anthropic Batch API job lifecycle
import anthropic
import time
client = anthropic.Anthropic()
batch = client.messages.batches.create(requests=[...])
print(f"Batch {batch.id} created. State: {batch.processing_status}")
# Poll until complete
while batch.processing_status in ["queued", "in_progress"]:
time.sleep(30)
batch = client.messages.batches.retrieve(batch.id)
print(f"Batch {batch.id} state: {batch.processing_status}")
# Retrieve results
if batch.processing_status == "succeeded":
results = client.messages.batches.results(batch.id)
for result in results:
print(result.result)Batch APIs often charge 50% less per token than real-time APIs, making them ideal for non-urgent bulk processing. Group requests by priority: urgent batches run within 1 hour, standard batches within 24 hours. This two-tier approach maximizes savings while meeting SLAs.
# Prepare batch requests efficiently
requests = []
for doc in documents:
requests.append({
"custom_id": doc.id,
"params": {
"model": "claude-3-5-sonnet-20241022",
"max_tokens": 1024,
"messages": [{"role": "user", "content": f"Summarize: {doc.text}"}]
}
})
# Submit batch (receives 50% discount vs real-time)
batch = client.messages.batches.create(requests=requests)Batch jobs encounter transient failures (temporary service outages) and permanent failures (invalid input). Implement exponential backoff for retries; don't immediately re-queue a failed job. Failed individual requests within a batch should be logged separately so you can investigate and resubmit specific failures without re-processing successful ones.
| Error Type | Status Code | Retry? | Action |
|---|---|---|---|
| Rate Limit | 429 | Yes (exponential backoff) | Retry after 60s, then 120s, then 240s |
| Service Unavailable | 503 | Yes (exponential backoff) | Retry after 30s, then 60s, up to 5 retries |
| Invalid Input | 400 | No | Log error, skip this request, continue batch |
| Auth Failed | 401 | No | Stop batch, check API key/credentials |
| Timeout | 504 | Yes (exponential backoff) | Retry with increased timeout window |
Batch Job Deduplication: If your batch submission process can be interrupted or retried, use idempotent custom_ids. Generate each ID deterministically from the input (e.g., SHA256 hash of the document) so re-submitting the same batch twice doesn't create duplicates. This pattern is crucial for distributed systems where network failures might cause duplicate submissions.
For large-scale batch operations (100k+ requests), split into multiple smaller batches (10k each) to reduce blast radius if one batch encounters an error. Track overall progress across all batches in a master coordination table; this makes it easy to resume if the batch submission service itself crashes mid-operation.
Batch Scheduling & Resource Allocation: Batch services manage a queue of jobs competing for limited compute resources. Implement fair scheduling: don't let one large job monopolize the queue. Use priority levels (P1 urgent, P2 standard, P3 background) with separate queues; process P1 jobs first, but always reserve some capacity for P2/P3 to prevent starvation. For multi-tenant systems, add resource quotas per tenant: "Team A gets 10 concurrent jobs max, Team B gets 5" prevents one team from consuming all capacity. Monitor queue depth and latency; if queue exceeds 5000 jobs, activate auto-scaling or notify ops to add more workers.
Implement progressive batch execution: don't wait for all results before returning to the user. As each individual result completes, stream it to the client via server-sent events (SSE) or webhook. This improves user experience and enables fault tolerance: if processing stops halfway, the user sees partial results and can retry the batch for the remaining items.
Monitoring and observability are essential for production systems. Set up comprehensive logging at every layer: API requests, model predictions, database queries, cache hits/misses. Use structured logging (JSON) to enable filtering and aggregation across thousands of servers. For production deployments, track not just errors but also latency percentiles (p50, p95, p99); if p99 latency suddenly doubles, something is wrong even if error rates are normal. Set up alerting based on SLO violations: if a service is supposed to have 99.9% availability and it drops to 99.5%, alert immediately. Use distributed tracing (Jaeger, Lightstep) to track requests across multiple services; a slow end-to-end latency might be hidden in one deep service call, invisible in aggregate metrics.
For long-running ML jobs (training, batch inference), implement checkpoint recovery and graceful degradation. If a training job crashes after 2 weeks, you want to resume from the last checkpoint, not restart from scratch. Implement job orchestration with Kubernetes or Airflow to handle retries, resource allocation, and dependency management. Use feature flags for safe deployment: deploy new model versions behind a flag that's off by default, gradually roll out to 1% of users, 10%, then 100%, monitoring metrics at each step. If something goes wrong, flip the flag back instantly. This approach reduces risk and enables fast rollback.
Finally, build a culture of incident response and post-mortems. When something breaks (and it will), document the incident: timeline, root cause, mitigation steps, and preventive measures. Use incidents as learning opportunities; blameless post-mortems focus on systems, not people. Share findings across teams to prevent repeat incidents. A well-documented incident history is an organization's institutional knowledge about system failures and how to avoid them.
The rapid evolution of AI infrastructure requires continuous learning and adaptation. Teams should establish regular tech talks and knowledge-sharing sessions where engineers present lessons learned from production deployments, performance optimization work, and incident postmortems. Create internal wiki pages documenting best practices specific to your organization: how to debug common failure modes, performance tuning guides for your hardware, and checklists for safe deployments. This prevents repeating mistakes and accelerates onboarding of new team members.
Build relationships with vendors and open-source communities. If you encounter bugs in frameworks (PyTorch, JAX), file detailed reports. If you have questions, ask on forums; community members often have encountered similar issues. For mission-critical infrastructure, consider purchasing support contracts with vendors (PyTorch, HuggingFace, cloud providers). Support gives you direct access to engineers who understand your system and can prioritize fixes. This is insurance against production outages caused by third-party software bugs.
Finally, remember that optimization is a journey, not a destination. Today's cutting-edge technique becomes tomorrow's baseline. Allocate 10-15% of engineering time to exploration and experimentation. Some experiments will fail, but successful ones compound into significant efficiency gains. Foster a culture of continuous improvement: measure, analyze, iterate, and share results. The teams that stay ahead are those that invest in understanding their systems deeply and adapting proactively to new technologies and changing demands.
Key Takeaway: Success in GenAI infrastructure depends on mastering fundamentals: understand your hardware constraints, profile your workloads, measure everything, and iterate. The most sophisticated techniques (dynamic batching, mixed precision, distributed training) build on solid foundations of clear thinking and empirical validation. Avoid cargo-cult engineering: if you don't understand why a technique helps your specific use case, it probably won't. Invest time in understanding root causes, not just applying trendy solutions. Over time, this rigor will compound into significant competitive advantage.