Build vs Buy vs OSS

Three options explained
Cost comparison at scale
Privacy and compliance
Latency and reliability
Decision framework
Migration paths
Gotchas

SECTION 01

Three options explained

Buy (API providers): Use OpenAI, Anthropic, Google, or other providers via their REST APIs. Pay per token. No infrastructure to manage. Access to the frontier models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro). Fast to prototype and ship. Vendor lock-in and data leaves your infrastructure.

OSS self-hosted: Run Llama 3, Mistral, Qwen, or other open models on your own GPU infrastructure. Full control over data, model, and deployment. Higher upfront cost (GPUs) but lower marginal cost at scale. Requires ML Ops expertise.

Managed OSS: Services like Together AI, Groq, Fireworks AI, Replicate, and Baseten host open models for you. API-style access to open models without managing GPUs. Middle path — pays per token like closed APIs but with open model flexibility and often lower prices.

SECTION 02

Cost comparison at scale

def cost_analysis(monthly_input_tokens: int, monthly_output_tokens: int):
    # OpenAI GPT-4o
    gpt4o_cost = (monthly_input_tokens * 0.0025 + monthly_output_tokens * 0.01) / 1_000_000

    # Anthropic Claude 3.5 Haiku
    haiku_cost = (monthly_input_tokens * 0.0008 + monthly_output_tokens * 0.004) / 1_000_000

    # Together AI Llama 3 70B
    together_cost = (monthly_input_tokens + monthly_output_tokens) * 0.0009 / 1_000_000

    # Self-hosted Llama 3 70B on 2x A100 80GB (~$5/hr on-demand, $2.50/hr reserved)
    # Assume 30k tokens/sec throughput, 720 hrs/month
    reserved_hours = 720
    gpu_cost = 2 * 2.50 * reserved_hours  # 2 GPUs, reserved pricing
    # This is fixed regardless of actual usage

    print(f"Monthly volume: {monthly_input_tokens/1e6:.1f}M input, {monthly_output_tokens/1e6:.1f}M output tokens")
    print(f"GPT-4o:          ${gpt4o_cost:,.0f}/month")
    print(f"Claude Haiku:    ${haiku_cost:,.0f}/month")
    print(f"Together Llama3: ${together_cost:,.0f}/month")
    print(f"Self-hosted:     ${gpu_cost:,.0f}/month (fixed)")

cost_analysis(100_000_000, 20_000_000)  # 100M input, 20M output tokens/month
# GPT-4o:          $450/month
# Claude Haiku:    $160/month
# Together Llama3: $108/month
# Self-hosted:     $3,600/month (fixed, but breaks even at ~1B tokens/month)

SECTION 03

Privacy and compliance

Data privacy is often the primary driver of the build-vs-buy decision:

HIPAA / PHI: OpenAI, Azure OpenAI, and AWS Bedrock offer BAA agreements for healthcare. Smaller managed providers typically don't. Self-hosted is the safest option.
GDPR / EU data residency: Azure OpenAI has EU data residency options. Using US-based API providers for EU user data requires careful legal review. Self-hosting in EU regions is cleanest.
Financial data / PCI: Customer data sent to third-party LLMs may violate your data handling policies. Check with your legal team before sending sensitive financial data to any cloud API.
Code / IP: Many enterprises won't send proprietary source code to external APIs. GitHub Copilot's enterprise tier and Amazon Q have specific IP protection terms; general OpenAI API does not.

SECTION 04

Latency and reliability

API providers: Median latency 0.5–2s for first token. SLA 99.9% uptime. Outages happen (OpenAI had several high-profile incidents in 2023–2024). No SLA on throughput — can throttle under load.
Managed OSS: Groq's purpose-built LPU chips deliver sub-100ms first token latency — faster than anything self-hosted. But they host a limited model selection.
Self-hosted: You control the hardware, so no external throttling. But you're responsible for availability. Multi-node vLLM clusters with load balancers can hit 99.95% uptime with engineering effort.

SECTION 05

Decision framework

Use this scoring framework for your project:

Use API (Buy) if: team < 10 engineers, < 1M tokens/day, no strict data residency requirements, need frontier model quality, shipping in < 3 months.
Use Managed OSS if: cost sensitivity but no GPU expertise, open-source model quality sufficient, need lower per-token cost than closed APIs, want API convenience without model lock-in.
Self-host if: > 100M tokens/day (self-hosted breaks even), strict data residency (HIPAA, GDPR, classified), need custom model fine-tuning in production, have GPU infrastructure team.

SECTION 06

Migration paths

The most common migration path is: API → Managed OSS → Self-hosted as the project matures and scale increases. Design for migration from day one:

Use the OpenAI-compatible API abstraction everywhere — don't hardcode OpenAI SDK calls.
Use LiteLLM or a similar router to swap providers via config, not code changes.
Track per-provider latency, cost, and quality metrics so you can make data-driven switching decisions.
Fine-tune an open model on your task data before migrating off a frontier API — the quality gap narrows dramatically with task-specific fine-tuning.

SECTION 07

Gotchas

TCO of self-hosting: GPU cost is only ~60% of total self-hosting cost. Factor in: ML engineers to manage vLLM, oncall rotation, model update cycles, monitoring infrastructure, and DR/HA setup. Many teams underestimate this and return to APIs.
Model quality gap: At the time of writing, GPT-4o and Claude 3.5 Sonnet still outperform open models on complex reasoning, long-context tasks, and instruction following. For many real-world tasks, open models are sufficient — but verify before migrating.
Vendor lock-in is real but manageable: The OpenAI API format is now a de facto standard. Migrating from OpenAI to Anthropic requires only a different SDK — the business logic stays the same if you've abstracted properly.

Build vs. Buy Decision Framework

The build vs. buy decision for LLM infrastructure — model hosting, RAG pipelines, evaluation tooling, monitoring systems — involves balancing control, cost, time-to-value, and maintenance burden. The right choice varies significantly based on organization size, team expertise, data sensitivity requirements, and the strategic importance of AI capabilities to the business.

Dimension	Build	Buy	When to Build
Time to deploy	Weeks–months	Days	Unique requirements
Customization	Full	Limited	Differentiated product
Maintenance	Team owns	Vendor owns	Stable, well-understood needs
Data privacy	Full control	Depends on vendor	Strict compliance needs
Total cost	High upfront, lower scale	Predictable, scale pricing	Very high volume

Generic infrastructure components — vector databases, embedding APIs, LLM serving layers — rarely provide competitive differentiation and are strong candidates for buying. The cost of building and maintaining a production-grade vector database with sub-millisecond query latency, horizontal scaling, and replication is enormous relative to the differentiated value it provides. Teams that "buy" these components using managed services can redirect engineering effort toward the application logic, retrieval quality, and domain-specific fine-tuning that actually differentiates their product.

Model fine-tuning and evaluation infrastructure present a more nuanced build vs. buy calculation. Fine-tuning workflows are highly specific to training data formats, target tasks, and quality metrics that vary significantly across organizations. Generic fine-tuning platforms often require more adaptation effort than building targeted scripts around open-source libraries like HuggingFace Transformers and PEFT. Evaluation infrastructure similarly benefits from domain-specific metric design that generic platforms cannot anticipate. For teams with strong ML engineering capabilities, building evaluation and fine-tuning tooling in-house often delivers better results than adapting generic platforms to specialized requirements.

# Build vs. Buy scoring matrix (example scoring)
components = {
    "Vector DB":        {"differentiation": 1, "complexity": 8, "→": "Buy (Pinecone/Qdrant)"},
    "LLM API":          {"differentiation": 2, "complexity": 9, "→": "Buy (OpenAI/Anthropic)"},
    "RAG pipeline":     {"differentiation": 6, "complexity": 5, "→": "Build (domain-specific)"},
    "Fine-tuning":      {"differentiation": 8, "complexity": 6, "→": "Build (task-specific)"},
    "Eval framework":   {"differentiation": 7, "complexity": 4, "→": "Build (custom metrics)"},
    "Monitoring":       {"differentiation": 3, "complexity": 6, "→": "Buy (LangFuse/W&B)"},
}
# Rule: differentiation > 5 AND complexity < 7 → Build

Vendor lock-in risk is a key consideration in the build vs. buy calculation for LLM infrastructure. Deeply integrating a vendor's proprietary APIs, data formats, and deployment patterns creates switching costs that grow over time as more application code depends on vendor-specific interfaces. Mitigation strategies include building thin abstraction layers over vendor APIs that can be reimplemented against different backends, using open standards like the OpenAI API format that multiple providers support, and maintaining the option to self-host open-weight models as a fallback. The lock-in risk is higher for foundation model providers than for infrastructure components where competitive alternatives exist.

The organizational capability to build and maintain LLM infrastructure is as important as the technical feasibility. A team without dedicated ML engineers, DevOps expertise, and GPU infrastructure management experience will struggle to build and operate production-grade model serving infrastructure regardless of technical approach. Honest capability assessment — staffing levels, skill sets, and engineering bandwidth available for infrastructure maintenance — often determines build vs. buy decisions more than abstract cost calculations. Buying frees scarce engineering capacity for application development; building requires sustained infrastructure investment that competes with feature development priorities.