Run large volumes of LLM inference offline at 50% cost by batching requests through provider batch APIs. Ideal for dataset labelling, embedding generation, and nightly evaluation runs.
Batch execution is for workloads where you have many independent requests and don't need the result immediately β within minutes, not seconds. Classic use cases: annotating a training dataset, computing embeddings for a document corpus, running nightly evaluations, generating synthetic data.
The tradeoff: you get 50% lower cost and a separate rate-limit quota (doesn't count against your real-time tier), but you accept up to 24-hour turnaround and no streaming. If you need results in <60 seconds, use the real-time API instead.
import json, openai, time
from pathlib import Path
client = openai.OpenAI()
# Step 1: Create a JSONL file with your requests
requests = [
{"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-4o-mini", "max_tokens": 256,
"messages": [{"role": "user", "content": f"Summarise: {doc}"}]}}
for i, doc in enumerate(["Document A text...", "Document B text...", "Document C text..."])
]
jsonl_path = Path("/tmp/batch_requests.jsonl")
with open(jsonl_path, "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Step 2: Upload the file
batch_file = client.files.create(file=open(jsonl_path, "rb"), purpose="batch")
print(f"Uploaded file: {batch_file.id}")
# Step 3: Create the batch job
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print(f"Batch ID: {batch.id}, Status: {batch.status}")
import json, time
def wait_for_batch(batch_id: str, poll_interval: int = 60) -> dict:
while True:
batch = client.batches.retrieve(batch_id)
print(f"Status: {batch.status} | "
f"Completed: {batch.request_counts.completed}/"
f"{batch.request_counts.total}")
if batch.status in ("completed", "failed", "cancelled", "expired"):
return batch
time.sleep(poll_interval)
def download_results(batch) -> list[dict]:
if batch.status != "completed":
raise RuntimeError(f"Batch {batch.status}")
content = client.files.content(batch.output_file_id)
results = []
for line in content.text.strip().split("\n"):
results.append(json.loads(line))
return results
# Poll and retrieve
batch = wait_for_batch(batch.id, poll_interval=300)
results = download_results(batch)
for r in results:
custom_id = r["custom_id"]
if r["error"] is None:
text = r["response"]["body"]["choices"][0]["message"]["content"]
print(f"{custom_id}: {text[:80]}...")
else:
print(f"{custom_id}: ERROR β {r['error']}")
A batch job can complete with some requests failed. Always check the error file:
def get_errors(batch) -> list[dict]:
if not batch.error_file_id:
return []
content = client.files.content(batch.error_file_id)
return [json.loads(line) for line in content.text.strip().split("\n")]
errors = get_errors(batch)
if errors:
print(f"{len(errors)} requests failed:")
for e in errors:
print(f" {e['custom_id']}: {e['error']['message']}")
# Re-submit failed requests as a new batch
failed_ids = {e["custom_id"] for e in errors}
retry_requests = [r for r in requests if r["custom_id"] in failed_ids]
OpenAI Batch API pricing is 50% of real-time API rates. For a corpus of 100k documents at ~1k tokens each:
Throughput: The batch API processes in parallel. A 10k-request batch typically completes in 1β4 hours. For nightly jobs, this is fine. The separate quota means batch jobs don't interfere with your real-time serving quota.
For models you host yourself (vLLM, TGI), continuous batching is handled automatically β the engine batches concurrent requests at the GPU level. For truly offline batch jobs on self-hosted models:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
params = SamplingParams(temperature=0.0, max_tokens=256)
prompts = ["Summarise: " + doc for doc in documents]
outputs = llm.generate(prompts, params) # batched automatically
for output in outputs:
print(output.outputs[0].text[:80])
The OpenAI Batch API provides 50% cost reduction compared to synchronous API calls in exchange for accepting up to 24-hour completion SLA. This tradeoff is appropriate for workloads that do not require real-time responses: document processing, evaluation dataset scoring, content generation for offline review, and nightly report generation. The cost reduction compounds with volume β for applications running millions of evaluations monthly, the batch discount translates directly to substantial savings without any code complexity increase beyond managing job IDs and polling for results.
| Execution mode | Latency | Cost | Use case |
|---|---|---|---|
| Synchronous API | Seconds | Full price | Real-time user requests |
| OpenAI Batch API | Up to 24h | 50% off | Offline processing, evals |
| Anthropic Batch API | Up to 24h | 50% off | Offline processing, evals |
| Local batching (vLLM) | Minutesβhours | GPU cost only | Private data, max savings |
Batch job result management requires careful design to handle partial failures without reprocessing successful requests. The OpenAI batch output format returns one result per input request with a custom_id field that maps results back to inputs β requests that failed with API errors appear in a separate error file while successful results appear in the output file. Applications should persist both the input mapping and the batch job ID so that reprocessing only the failed items is possible without re-submitting the entire batch when partial failures occur.
Asynchronous batch execution decouples request ingestion from processing, enabling efficient handling of variable-rate workloads. Python asyncio patterns with concurrent.futures.ThreadPoolExecutor allow batching CPU-bound operations (model inference) while accepting requests concurrently. A typical implementation maintains a request queue (asyncio.Queue) that fills until batch_size requests accumulate or timeout expires; a worker coroutine dequeues the batch, processes it, and returns results to individual requestors. With timeout=100ms and batch_size=32, throughput increases 4β6Γ compared to single-request processing due to amortized overhead, at the cost of 100ms additional latency. Backpressure control is critical: setting queue.maxsize limits memory growth when requests arrive faster than inference can process. For example, maxsize=1000 with 100-byte request metadata limits queue memory to ~100KB, preventing runaway memory growth. Kubernetes deployments use HPA (Horizontal Pod Autoscaling) triggered by queue depth to dynamically scale inference workers, maintaining SLA latency while minimizing idle resources.
Batch execution introduces failure modes absent in synchronous execution: individual items in a batch may fail while others succeed, network timeouts may affect entire batches, and inference servers may crash mid-batch. Robust implementations separate retriable errors (temporary network blips, transient OOM) from permanent failures (malformed input, model not found). Exponential backoff with jitter β retry with delay = base Γ (2 ^ attempt) + random(0, jitter) β prevents thundering herd when a service recovers. For batch failures, partial retry strategies re-process only failed items instead of the entire batch, reducing wasted compute. For example, if 32/1024 items fail in a batch, re-sending only those 32 items requires ~3% of original computation. The circuit breaker pattern (fail-open after N consecutive failures, monitored error rate threshold) prevents cascading failures downstream: if inference service error rate exceeds 5% for 10 consecutive batches, stop sending requests temporarily and return fallback results. FastAPI with tenacity library implements this pattern elegantly: @retry(stop=stop_after_attempt(3), wait=wait_exponential(...)) decorators apply to batch processing functions.
Batch execution enables precise cost attribution and optimization. In cloud environments, compute cost is measured as (instance_cost_per_hour Γ runtime_hours) / tokens_processed. Batching improves this metric by reducing per-token overhead: a single-token request might spend 50ms idle on a GPU ($0.0014/token), while a 32-token batch amortizes this to $0.00004/token, 35Γ cheaper. Monitoring key metrics β utilization (% of time GPU is computing vs idle), batch_size distribution, and latency percentiles β reveals optimization opportunities. For LLM serving, batch_efficiency = (actual_throughput_tokens/s) / (peak_throughput_tokens/s) measures how close to maximum utilization the system operates; vLLM and Ollama achieve 70β85% efficiency through aggressive batching and KV cache reuse. Cost-sensitive applications (long-running batch inference jobs) should optimize for throughput_per_dollar rather than latency; increasing max_batch_size from 32 to 128 and timeout from 100ms to 500ms can improve efficiency 3β4Γ for overnight jobs, acceptable since individual request latency increases to <1 second. Logging cost per request (computed as instance_cost Γ batch_processing_time / batch_size) enables SLA-aware routing: expensive requests (low-latency SLA) route to smaller batches, cheap requests aggregate into larger batches.
Batch size directly impacts throughput (tokens/second) and latency (milliseconds per batch). For LLM inference on GPU, optimal batch size balances GPU utilization and memory constraints. Small batch_size=1 keeps GPU at 30β40% utilization (memory-bound, waiting for compute). batch_size=32 achieves 70β80% utilization (better). batch_size=256 hits compute saturation, further increases don't improve tokens/sec. Beyond saturation, latency per batch increases linearly (larger batch β slower per-batch computation despite same total throughput). For serving diverse SLAs: batch_size=32 with timeout=100ms balances latency (100ms + <10ms queueing = <110ms) and throughput (~2000 tokens/sec on A100). Adaptive batching adjusts batch_size based on queue depth and target latency SLA: if queue > 50 requests, increase batch_size to 64 (higher throughput); if latency creeps above SLA, reduce batch_size to prioritize responsiveness. Kubernetes StatefulSet with per-pod queue monitoring enables HPA to scale based on queue depth (PromQL: queue_size{pod=~"inference-.*"} > 100 triggers scale-up). This adaptive strategy maintains <150ms p99 latency while achieving 80β85% GPU utilization across load variations.
Real-world batches are heterogeneous: some requests want 10 tokens, others want 1000 tokens (different sequence lengths). Naive batching pads short sequences to max length, wasting compute. Sequence packing (grouping sequences by length, executing same-length batches separately) reduces padding but fragments batches, reducing GPU utilization. Optimized approach: bucket requests by length (10β50 tokens, 50β100 tokens, etc.), execute buckets as separate batches, maintain separate KV caches per bucket. For 1000 concurrent requests across 10 length buckets (~100 per bucket), optimal scheduling assigns requests to buckets to balance batch utilization: if bucket 1 has 150 requests and bucket 2 has 50, allocate 3 batches to bucket 1 (50 each) and 1 to bucket 2. Profiling per-bucket latency (bucket 1 achieves 90% utilization, bucket 2 achieves 40%), identify underutilized buckets and either increase bucket width (50-200 tokens instead of 50-100) or consolidate into shared batches with padding. Advanced systems (vLLM, SGLang) implement dynamic batching that continuously re-orders requests to maximize utilization; reported speedups are 3β5Γ vs fixed-batch approaches for real-world request distributions.