Integration Standards

OpenAI Compat API

The de facto REST API standard for LLM inference. vLLM, Ollama, LiteLLM, TGI, and most serving frameworks implement it. Write once, switch providers without code changes.

De facto standard
All major serving stacks
Drop-in swap
Provider-agnostic
OpenAI SDK
Works with any endpoint

Table of Contents

SECTION 01

What is the OpenAI-compatible API

OpenAI's Chat Completions API became the de facto standard for LLM inference after GPT-3.5 was released. The core endpoints (/v1/chat/completions, /v1/completions, /v1/embeddings) are now implemented by virtually every LLM serving framework: vLLM, Ollama, TGI, LM Studio, Llamafile, SGLang, and third-party providers (Groq, Together AI, Fireworks AI, Perplexity). If your code uses the OpenAI Python SDK, you can redirect it to any compatible server by changing the base_url — no other code changes needed.

SECTION 02

Core endpoints

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",  # point to any compatible server
    api_key="not-needed-for-local",       # required by SDK, can be any string
)

# 1. Chat completions
resp = client.chat.completions.create(
    model="llama3",    # model name as registered in the server
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7,
    max_tokens=256,
    stream=False,
)
print(resp.choices[0].message.content)

# 2. Streaming
for chunk in client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True,
):
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# 3. Embeddings
emb = client.embeddings.create(
    model="nomic-embed-text",
    input=["Hello world", "Goodbye world"],
)
vectors = [e.embedding for e in emb.data]
print(f"Embedding dim: {len(vectors[0])}")
SECTION 03

Switching providers with base_url

import os

# OpenAI
openai_client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Groq (fast inference, OpenAI-compatible)
groq_client = openai.OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=os.environ["GROQ_API_KEY"],
)

# Together AI
together_client = openai.OpenAI(
    base_url="https://api.together.xyz/v1",
    api_key=os.environ["TOGETHER_API_KEY"],
)

# Local Ollama
ollama_client = openai.OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
)

# Local vLLM
vllm_client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="vllm",
)

# All clients use the same API — swap by changing the client object
def generate(client, model: str, prompt: str) -> str:
    return client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
    ).choices[0].message.content
SECTION 04

LiteLLM for multi-provider routing

pip install litellm

import litellm

# LiteLLM normalises all providers to the OpenAI format
# Prefix model string with provider: "anthropic/", "groq/", etc.

response = litellm.completion(
    model="gpt-4o-mini",            # OpenAI
    messages=[{"role": "user", "content": "Hi"}],
)
response = litellm.completion(
    model="anthropic/claude-3-haiku-20240307",  # Anthropic
    messages=[{"role": "user", "content": "Hi"}],
)
response = litellm.completion(
    model="groq/llama3-8b-8192",   # Groq
    messages=[{"role": "user", "content": "Hi"}],
)

# LiteLLM proxy: OpenAI-compatible server that routes to any provider
# litellm --model gpt-4o-mini --port 8000
# Then point any OpenAI SDK client at http://localhost:8000/v1
SECTION 05

Running locally with Ollama

# Install Ollama: https://ollama.com/download
ollama serve   # starts OpenAI-compatible server on localhost:11434
ollama pull llama3.2  # download a model

# Test with curl
curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'
client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(resp.choices[0].message.content)
SECTION 06

vLLM OpenAI-compatible server

# Start vLLM with OpenAI-compatible API
vllm serve meta-llama/Llama-3-8B-Instruct \
    --port 8000 \
    --api-key your-secret-key \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9
client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-key",
)
# Model name = the HuggingFace model ID
resp = client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Summarise transformers in 2 sentences."}],
)
print(resp.choices[0].message.content)
SECTION 07

Gotchas

Provider comparison and migration

The OpenAI-compatible API standard has been adopted by Anthropic, Google (Gemini), Groq, Together AI, Fireworks, Anyscale, and self-hosted options including vLLM and Ollama, creating a broadly interoperable ecosystem. The standard covers chat completions and embeddings but not provider-specific features like Anthropic's extended thinking, Google's grounding, or OpenAI's reasoning effort parameters. Applications that require provider-specific features must implement conditional logic based on the active provider, while applications that only use the common subset can switch providers with a single configuration change.

Providerbase_urlCompatible models
OpenAIhttps://api.openai.com/v1GPT-4o, GPT-4o-mini
Anthropichttps://api.anthropic.com/v1Claude 3.5 (via SDK)
Groqhttps://api.groq.com/openai/v1Llama 3, Mixtral
Together AIhttps://api.together.xyz/v1Llama, Qwen, others
Ollama (local)http://localhost:11434/v1Any pulled model

Streaming and chunked transfer encoding

The OpenAI-compatible API standard supports server-sent events (SSE) streaming for real-time token-by-token generation. Instead of waiting for the full response, clients receive chunks: {"choices":[{"delta":{"content":"token"}}]}. This enables responsive UI experiences where users see text appearing in real-time, and early termination (stopping generation mid-stream) saves computation. Implementing streaming correctly requires handling: chunked transfer encoding (HTTP/1.1 trailers), connection timeout management (keep-alive to prevent premature connection closure), and client-side stream consumption (parse newline-delimited JSON). Server implementations buffer generation in a background task, yielding from a queue as tokens become available. Protocol nuances matter: some clients expect a final {"choices":[{"finish_reason":"stop"}]} message to detect stream completion, and improper formatting (missing newlines, incomplete JSON) causes hangs. Production deployments often add stream-specific observability: latency of first token (time-to-first-byte), throughput of subsequent tokens, and handling of client disconnect mid-stream (graceful cleanup).

Function calling and tool use standardization

The OpenAI API standard now includes function calling: models receive a list of available functions and return structured JSON indicating which function to call with what arguments, rather than free-text responses. This standardization enables deterministic downstream orchestration: parse the function choice and arguments, invoke the actual tool, and feed results back into the model for refinement. Compatible implementations must handle: function schema specification (parameter types, descriptions), validation of model-generated arguments against schemas, and error handling when models call non-existent functions or provide invalid arguments. Most compatible implementations use constrained decoding or guided generation to ensure model output strictly conforms to JSON schema, preventing malformed responses that downstream code cannot parse. For LLM applications, function calling shifts from fragile regex/prompt-engineering-based parsing to formally verified contracts, significantly improving reliability in production multi-step agent architectures.

Model versioning and API stability contracts

The OpenAI compatibility standard benefits from transparent versioning: model names like gpt-4-turbo-2024-04 encode the model family and release date, enabling clients to depend on stable behavior while upgrading infrastructure. A compatibility layer supporting multiple models must handle: semantic differences between models (context window size, supported functions, tokenization differences), graceful degradation when a requested model is unavailable (fallback to nearest compatible version), and migration tooling to update application code when primary models become deprecated. Production systems implement model compatibility matrices: documenting which models support which features (streaming, vision, tool_use), and providing explicit error messages when users request unsupported combinations (e.g., vision on a text-only fallback model). This reduces surprise production incidents where applications suddenly fail after upstream model changes and positions compatible implementations as stable infrastructure enabling downstream application teams to innovate at their own pace.

Streaming and chunked transfer encoding

The OpenAI-compatible API standard supports server-sent events (SSE) streaming for real-time token-by-token generation. Instead of waiting for the full response, clients receive chunks: {"choices":[{"delta":{"content":"token"}}]}. This enables responsive UI experiences where users see text appearing in real-time, and early termination (stopping generation mid-stream) saves computation. Implementing streaming correctly requires handling: chunked transfer encoding (HTTP/1.1 trailers), connection timeout management (keep-alive to prevent premature connection closure), and client-side stream consumption (parse newline-delimited JSON). Server implementations buffer generation in a background task, yielding from a queue as tokens become available. Protocol nuances matter: some clients expect a final {"choices":[{"finish_reason":"stop"}]} message to detect stream completion, and improper formatting (missing newlines, incomplete JSON) causes hangs. Production deployments often add stream-specific observability: latency of first token (time-to-first-byte), throughput of subsequent tokens, and handling of client disconnect mid-stream (graceful cleanup).