Batch Execution

When to use batch execution
OpenAI Batch API walkthrough
Managing batch jobs in Python
Error handling and partial failures
Cost and throughput estimation
Alternatives: local batching
Gotchas

SECTION 01

When to use batch execution

Batch execution is for workloads where you have many independent requests and don't need the result immediately — within minutes, not seconds. Classic use cases: annotating a training dataset, computing embeddings for a document corpus, running nightly evaluations, generating synthetic data.

The tradeoff: you get 50% lower cost and a separate rate-limit quota (doesn't count against your real-time tier), but you accept up to 24-hour turnaround and no streaming. If you need results in <60 seconds, use the real-time API instead.

SECTION 02

OpenAI Batch API walkthrough

import json, openai, time
from pathlib import Path

client = openai.OpenAI()

# Step 1: Create a JSONL file with your requests
requests = [
    {"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
     "body": {"model": "gpt-4o-mini", "max_tokens": 256,
              "messages": [{"role": "user", "content": f"Summarise: {doc}"}]}}
    for i, doc in enumerate(["Document A text...", "Document B text...", "Document C text..."])
]

jsonl_path = Path("/tmp/batch_requests.jsonl")
with open(jsonl_path, "w") as f:
    for req in requests:
        f.write(json.dumps(req) + "\n")

# Step 2: Upload the file
batch_file = client.files.create(file=open(jsonl_path, "rb"), purpose="batch")
print(f"Uploaded file: {batch_file.id}")

# Step 3: Create the batch job
batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)
print(f"Batch ID: {batch.id}, Status: {batch.status}")

SECTION 03

Managing batch jobs in Python

import json, time

def wait_for_batch(batch_id: str, poll_interval: int = 60) -> dict:
    while True:
        batch = client.batches.retrieve(batch_id)
        print(f"Status: {batch.status} | "
              f"Completed: {batch.request_counts.completed}/"
              f"{batch.request_counts.total}")

        if batch.status in ("completed", "failed", "cancelled", "expired"):
            return batch
        time.sleep(poll_interval)

def download_results(batch) -> list[dict]:
    if batch.status != "completed":
        raise RuntimeError(f"Batch {batch.status}")

    content = client.files.content(batch.output_file_id)
    results = []
    for line in content.text.strip().split("\n"):
        results.append(json.loads(line))
    return results

# Poll and retrieve
batch = wait_for_batch(batch.id, poll_interval=300)
results = download_results(batch)

for r in results:
    custom_id = r["custom_id"]
    if r["error"] is None:
        text = r["response"]["body"]["choices"][0]["message"]["content"]
        print(f"{custom_id}: {text[:80]}...")
    else:
        print(f"{custom_id}: ERROR — {r['error']}")

SECTION 04

Error handling and partial failures

A batch job can complete with some requests failed. Always check the error file:

def get_errors(batch) -> list[dict]:
    if not batch.error_file_id:
        return []
    content = client.files.content(batch.error_file_id)
    return [json.loads(line) for line in content.text.strip().split("\n")]

errors = get_errors(batch)
if errors:
    print(f"{len(errors)} requests failed:")
    for e in errors:
        print(f"  {e['custom_id']}: {e['error']['message']}")
    # Re-submit failed requests as a new batch
    failed_ids = {e["custom_id"] for e in errors}
    retry_requests = [r for r in requests if r["custom_id"] in failed_ids]

SECTION 05

Cost and throughput estimation

OpenAI Batch API pricing is 50% of real-time API rates. For a corpus of 100k documents at ~1k tokens each:

Real-time: 100M tokens × $0.15/1M (gpt-4o-mini input) = $15
Batch API: same tokens × $0.075/1M = $7.50

Throughput: The batch API processes in parallel. A 10k-request batch typically completes in 1–4 hours. For nightly jobs, this is fine. The separate quota means batch jobs don't interfere with your real-time serving quota.

SECTION 06

Alternatives: local batching

For models you host yourself (vLLM, TGI), continuous batching is handled automatically — the engine batches concurrent requests at the GPU level. For truly offline batch jobs on self-hosted models:

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3-8B-Instruct")
params = SamplingParams(temperature=0.0, max_tokens=256)

prompts = ["Summarise: " + doc for doc in documents]
outputs = llm.generate(prompts, params)  # batched automatically

for output in outputs:
    print(output.outputs[0].text[:80])

SECTION 07

Gotchas

24h is an SLA, not a guarantee: OpenAI may complete jobs faster or slower. Design your pipeline to tolerate variance.
File size limits: The input JSONL file must be <200MB and contain <50k requests per batch. Split large jobs into multiple batches.
No streaming: Batch API responses are complete — you can't stream partial results as requests finish. If you need incremental results, poll the batch and process the output file incrementally.
custom_id collisions: Each custom_id must be unique within a batch. Use UUIDs or hash the input content.

Batch execution patterns and cost analysis

The OpenAI Batch API provides 50% cost reduction compared to synchronous API calls in exchange for accepting up to 24-hour completion SLA. This tradeoff is appropriate for workloads that do not require real-time responses: document processing, evaluation dataset scoring, content generation for offline review, and nightly report generation. The cost reduction compounds with volume — for applications running millions of evaluations monthly, the batch discount translates directly to substantial savings without any code complexity increase beyond managing job IDs and polling for results.

Execution mode	Latency	Cost	Use case
Synchronous API	Seconds	Full price	Real-time user requests
OpenAI Batch API	Up to 24h	50% off	Offline processing, evals
Anthropic Batch API	Up to 24h	50% off	Offline processing, evals
Local batching (vLLM)	Minutes–hours	GPU cost only	Private data, max savings

Batch job result management requires careful design to handle partial failures without reprocessing successful requests. The OpenAI batch output format returns one result per input request with a custom_id field that maps results back to inputs — requests that failed with API errors appear in a separate error file while successful results appear in the output file. Applications should persist both the input mapping and the batch job ID so that reprocessing only the failed items is possible without re-submitting the entire batch when partial failures occur.

Asynchronous Batch Processing Patterns and Backpressure

Asynchronous batch execution decouples request ingestion from processing, enabling efficient handling of variable-rate workloads. Python asyncio patterns with concurrent.futures.ThreadPoolExecutor allow batching CPU-bound operations (model inference) while accepting requests concurrently. A typical implementation maintains a request queue (asyncio.Queue) that fills until batch_size requests accumulate or timeout expires; a worker coroutine dequeues the batch, processes it, and returns results to individual requestors. With timeout=100ms and batch_size=32, throughput increases 4–6× compared to single-request processing due to amortized overhead, at the cost of 100ms additional latency. Backpressure control is critical: setting queue.maxsize limits memory growth when requests arrive faster than inference can process. For example, maxsize=1000 with 100-byte request metadata limits queue memory to ~100KB, preventing runaway memory growth. Kubernetes deployments use HPA (Horizontal Pod Autoscaling) triggered by queue depth to dynamically scale inference workers, maintaining SLA latency while minimizing idle resources.

Error Handling, Retry Logic, and Circuit Breaking

Batch execution introduces failure modes absent in synchronous execution: individual items in a batch may fail while others succeed, network timeouts may affect entire batches, and inference servers may crash mid-batch. Robust implementations separate retriable errors (temporary network blips, transient OOM) from permanent failures (malformed input, model not found). Exponential backoff with jitter — retry with delay = base × (2 ^ attempt) + random(0, jitter) — prevents thundering herd when a service recovers. For batch failures, partial retry strategies re-process only failed items instead of the entire batch, reducing wasted compute. For example, if 32/1024 items fail in a batch, re-sending only those 32 items requires ~3% of original computation. The circuit breaker pattern (fail-open after N consecutive failures, monitored error rate threshold) prevents cascading failures downstream: if inference service error rate exceeds 5% for 10 consecutive batches, stop sending requests temporarily and return fallback results. FastAPI with tenacity library implements this pattern elegantly: @retry(stop=stop_after_attempt(3), wait=wait_exponential(...)) decorators apply to batch processing functions.

Cost Management and Compute Efficiency Metrics

Batch execution enables precise cost attribution and optimization. In cloud environments, compute cost is measured as (instance_cost_per_hour × runtime_hours) / tokens_processed. Batching improves this metric by reducing per-token overhead: a single-token request might spend 50ms idle on a GPU ($0.0014/token), while a 32-token batch amortizes this to $0.00004/token, 35× cheaper. Monitoring key metrics — utilization (% of time GPU is computing vs idle), batch_size distribution, and latency percentiles — reveals optimization opportunities. For LLM serving, batch_efficiency = (actual_throughput_tokens/s) / (peak_throughput_tokens/s) measures how close to maximum utilization the system operates; vLLM and Ollama achieve 70–85% efficiency through aggressive batching and KV cache reuse. Cost-sensitive applications (long-running batch inference jobs) should optimize for throughput_per_dollar rather than latency; increasing max_batch_size from 32 to 128 and timeout from 100ms to 500ms can improve efficiency 3–4× for overnight jobs, acceptable since individual request latency increases to <1 second. Logging cost per request (computed as instance_cost × batch_processing_time / batch_size) enables SLA-aware routing: expensive requests (low-latency SLA) route to smaller batches, cheap requests aggregate into larger batches.

Batch Size Optimization and Throughput Tuning

Batch size directly impacts throughput (tokens/second) and latency (milliseconds per batch). For LLM inference on GPU, optimal batch size balances GPU utilization and memory constraints. Small batch_size=1 keeps GPU at 30–40% utilization (memory-bound, waiting for compute). batch_size=32 achieves 70–80% utilization (better). batch_size=256 hits compute saturation, further increases don't improve tokens/sec. Beyond saturation, latency per batch increases linearly (larger batch → slower per-batch computation despite same total throughput). For serving diverse SLAs: batch_size=32 with timeout=100ms balances latency (100ms + <10ms queueing = <110ms) and throughput (~2000 tokens/sec on A100). Adaptive batching adjusts batch_size based on queue depth and target latency SLA: if queue > 50 requests, increase batch_size to 64 (higher throughput); if latency creeps above SLA, reduce batch_size to prioritize responsiveness. Kubernetes StatefulSet with per-pod queue monitoring enables HPA to scale based on queue depth (PromQL: queue_size{pod=~"inference-.*"} > 100 triggers scale-up). This adaptive strategy maintains <150ms p99 latency while achieving 80–85% GPU utilization across load variations.

Batch Heterogeneity and Request Packing Strategies

Real-world batches are heterogeneous: some requests want 10 tokens, others want 1000 tokens (different sequence lengths). Naive batching pads short sequences to max length, wasting compute. Sequence packing (grouping sequences by length, executing same-length batches separately) reduces padding but fragments batches, reducing GPU utilization. Optimized approach: bucket requests by length (10–50 tokens, 50–100 tokens, etc.), execute buckets as separate batches, maintain separate KV caches per bucket. For 1000 concurrent requests across 10 length buckets (~100 per bucket), optimal scheduling assigns requests to buckets to balance batch utilization: if bucket 1 has 150 requests and bucket 2 has 50, allocate 3 batches to bucket 1 (50 each) and 1 to bucket 2. Profiling per-bucket latency (bucket 1 achieves 90% utilization, bucket 2 achieves 40%), identify underutilized buckets and either increase bucket width (50-200 tokens instead of 50-100) or consolidate into shared batches with padding. Advanced systems (vLLM, SGLang) implement dynamic batching that continuously re-orders requests to maximize utilization; reported speedups are 3–5× vs fixed-batch approaches for real-world request distributions.