Open Source LLM Models

Contents

Why open models
Model landscape
Licence comparison
Running locally
Fine-tuning open models
Tools & ecosystem
References

01 — Advantages

Why Open Models

Open-weight models offer four core advantages: Control — you own your weights and can run inference anywhere. Cost — no per-token API fees; run locally for marginal cost. Privacy — data never leaves your infrastructure. Customization — fine-tune and deploy proprietary variants.

The tradeoff: support and quality may lag proprietary models. OpenAI's GPT-4 still outperforms open alternatives on many benchmarks. But open models are rapidly catching up, and in many applications they're good enough and more cost-effective.

💡 Open doesn't mean free: You pay for compute, not tokens. A local Llama 3.3 70B on H100s costs more per inference than Claude API. But if you have bulk usage or strict privacy needs, open models win long-term.

02 — Market

Model Landscape Comparison

Model	Params	Context	MMLU	Licence	Best For
Llama 3.3 70B	70B	8K	86%	Meta Llama	Production, reasoning
Llama 3.1 8B	8B	128K	76%	Meta Llama	Edge, latency-sensitive
Mistral 7B	7B	32K	64%	Apache 2.0	Commercial, BSD-friendly
Mistral NeMo 12B	12B	128K	73%	Apache 2.0	Small + powerful
Gemma 2 9B	9B	8K	79%	Google Gemma	Quality on small size
Phi-3.5 Mini	3.8B	128K	69%	MIT	Ultra-light, edge
Qwen 2.5 72B	72B	128K	86%	Tongyi Qianwen	Chinese, multilingual
DeepSeek-V3	671B	128K	88%	DeepSeek	Frontier performance

Key Observations

7B-12B is the sweet spot: Mistral 7B, Gemma 2 9B, and Llama 3.1 8B offer strong quality at modest compute cost. Long context: Llama 3.1 and NeMo support 128K context — useful for retrieval and document processing. Multilingual: Qwen and DeepSeek excel at non-English. Tiny models: Phi-3.5 (3.8B) fits on phones, surprisingly capable.

⚠️ MMLU is one benchmark. Pick models by your actual workload: customer support, code, reasoning, etc. Fine-tuned 7B models often beat larger untuned ones on specific tasks.

Python · Compare open models on a custom benchmark

import time, statistics
from openai import OpenAI

# Both vLLM instances expose OpenAI-compatible API
clients = {
    "llama3-8b":    OpenAI(base_url="http://llama3:8000/v1",    api_key="x"),
    "mistral-7b":   OpenAI(base_url="http://mistral:8000/v1",   api_key="x"),
    "gemma2-9b":    OpenAI(base_url="http://gemma2:8000/v1",    api_key="x"),
}

EVAL_PROMPTS = [
    {"q": "What is the capital of France?", "a": "paris"},
    {"q": "Write a Python function to reverse a string.", "a": "def"},
    {"q": "Explain gradient descent in one sentence.", "a": "optimiz"},
]

def benchmark_model(name: str, client: OpenAI) -> dict:
    scores, latencies = [], []
    for item in EVAL_PROMPTS:
        t0 = time.perf_counter()
        resp = client.chat.completions.create(
            model=name,
            messages=[{"role": "user", "content": item["q"]}],
            max_tokens=128, temperature=0.0
        ).choices[0].message.content
        latencies.append(time.perf_counter() - t0)
        scores.append(1 if item["a"].lower() in resp.lower() else 0)
    return {
        "model": name,
        "accuracy": round(statistics.mean(scores), 2),
        "avg_latency_s": round(statistics.mean(latencies), 2)
    }

for name, c in clients.items():
    print(benchmark_model(name, c))

03 — Legal

Licence Comparison

Meta Llama Licence (Llama 2, 3, 3.1, 3.3)

Non-commercial use: Free. Commercial use allowed if you don't compete with Meta (AGI clauses removed in Llama 3.1+). Can't use Claude/GPT outputs to fine-tune. Can't claim your model is Meta's. Includes usage caps in some versions.

Apache 2.0 (Mistral, NeMo)

Fully permissive. Commercial use unrestricted. Must include license notice. No patent grants beyond the models themselves. Standard open-source licence.

Google Gemma Licence

Non-commercial by default. Google Gemma licence restricts to non-commercial use. Commercial licencing available by request.

MIT (Phi-3.5)

Most permissive. Do anything, anywhere. Standard MIT terms.

Tongyi Qianwen (Qwen) & DeepSeek

Custom licences. Qwen allows commercial use with terms. DeepSeek permits academic and commercial use.

💡 For commercial deployment: Prefer Apache 2.0 (Mistral, NeMo) or MIT (Phi). Meta Llama 3+ is safe commercially but read the fine print. Avoid non-commercial licences unless you have explicit permission.

04 — Deployment

Running Locally

Tool	GPU VRAM	Speed	Ease	Best For
Ollama	4GB (7B)	Good	Easiest	Local macOS/Linux, no setup
llama.cpp	2GB (7B)	Very fast	CLI-only	CPU inference, extreme efficiency
LM Studio	4GB (7B)	Good	Easiest (GUI)	Windows/Mac, interactive
vLLM	16GB+ (70B)	Fastest	Python setup	Production, batch inference
Transformers	Varies	Standard	Python	Research, custom logic

Quick Start: Ollama

# Install Ollama (macOS/Linux) # https://ollama.ai # Pull a model ollama pull llama2 # Run interactively ollama run llama2 # Or via API curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?", "stream": false }'

Ollama is the easiest starting point: It handles model downloads, quantization, and caching automatically. ollama run llama2 instantly gives you a chat interface.

Hardware Requirements

7B models: 8GB GPU RAM minimum (4GB with aggressive quantization). CPU inference is 10-50× slower but works. 70B models: 40GB for full precision, 20GB for 4-bit quantization. Requires H100 or multiple GPUs. Phi-3.5 (3.8B): 2GB GPU, runs on most machines.

05 — Customization

Fine-Tuning Open Models

The choice between fine-tuning, RAG, and prompt engineering depends on your use case and compute budget.

Decision Framework

📚 Use RAG When:

Knowledge is external (docs, databases)
Updates shouldn't require retraining
Quick implementation needed

✏️ Use Prompt Engineering When:

Few examples available
No compute budget
Style/format tuning only

⚡ Use Fine-Tuning When:

Domain-specific knowledge needed
Compute available
Large training dataset (1000+)

🔄 Use Hybrid When:

Fine-tune + RAG for best results
Fine-tune for style, RAG for facts
Higher implementation cost

Python · Run Llama 3 locally with Ollama (simplest path)

import requests, json

# Install Ollama: https://ollama.com
# Pull model: ollama pull llama3.2

OLLAMA_URL = "http://localhost:11434"

def ollama_chat(model: str, messages: list[dict],
                stream: bool = False) -> str:
    """Chat with a locally-running Ollama model."""
    resp = requests.post(
        f"{OLLAMA_URL}/api/chat",
        json={"model": model, "messages": messages, "stream": stream},
        timeout=120
    )
    resp.raise_for_status()
    return resp.json()["message"]["content"]

def ollama_list_models() -> list[str]:
    """List locally available models."""
    resp = requests.get(f"{OLLAMA_URL}/api/tags")
    return [m["name"] for m in resp.json().get("models", [])]

# OpenAI-compatible API (also supported by Ollama)
from openai import OpenAI
client = OpenAI(base_url=f"{OLLAMA_URL}/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "What is RAG in 2 sentences?"}]
)
print(response.choices[0].message.content)

# Other local options:
# - LM Studio: GUI app, supports GGUF models
# - llama.cpp: C++ inference, minimal dependencies
# - vLLM: production serving, best throughput

LoRA Fine-Tuning Workflow

Step 1: Prepare dataset (1000–10000 examples). Step 2: Use Axolotl/Unsloth with LoRA. Step 3: Merge LoRA weights into base model. Step 4: Deploy merged model or use LoRA in inference.

💡 LoRA is portable: A LoRA adapter is 10–50MB vs full model size (15GB for 7B). Distribute adapters easily, merge at inference time, or stack multiple LoRAs.

06 — Ecosystem

Tools & Frameworks

Runtime

Ollama

Easy, local inference. Model management. No code needed. macOS/Linux/Windows.

Inference

llama.cpp

CPU-first, ultra-efficient. Quantized models. C++ backend.

GUI

LM Studio

Windows/Mac GUI. Model downloads, chat interface, easy setup.

Server

vLLM

Production inference server. Batch processing, OpenAI API-compatible.

Hub

Hugging Face Hub

Model repository, 500K+ open models. Standard distribution format.

Open WebUI

Web chat interface. Works with Ollama, vLLM, OpenAI API.

Format

GGUF

Quantized model format. Small, fast. Used by llama.cpp/Ollama.

Apple

MLX

Apple Silicon optimization. Fast on M1/M2/M3. Python library.

07 — Further Reading

References

Model Cards & Docs

Docs Meta Llama. llama.com ↗
Docs Mistral AI. mistral.ai ↗
Docs Google Gemma. ai.google.dev/gemma ↗
Docs Microsoft Phi. huggingface.co/microsoft ↗
Docs Qwen. qwenlm.github.io ↗

Inference Tools

Tool Ollama. ollama.ai ↗
Tool llama.cpp GitHub. github.com/ggerganov/llama.cpp ↗
Tool LM Studio. lmstudio.ai ↗
Tool vLLM. vllm.ai ↗
Tool Open WebUI. openwebui.com ↗

Practitioner Writing

Blog Hugging Face. Open LLMs vs Closed: A Cost-Benefit Analysis. — huggingface.co/blog ↗
Blog Simon Willison. Running LLMs Locally. — simonwillison.net ↗

Open Source LLM Models

Why Open Models

Model Landscape Comparison

Key Observations

Licence Comparison

Meta Llama Licence (Llama 2, 3, 3.1, 3.3)

Apache 2.0 (Mistral, NeMo)

Google Gemma Licence

MIT (Phi-3.5)

Tongyi Qianwen (Qwen) & DeepSeek

Running Locally

Quick Start: Ollama

Hardware Requirements

Fine-Tuning Open Models

Decision Framework

📚 Use RAG When:

✏️ Use Prompt Engineering When:

⚡ Use Fine-Tuning When:

🔄 Use Hybrid When:

LoRA Fine-Tuning Workflow

Tools & Frameworks

References

Related concepts

References