Models · Open Source

Open Source LLM Models

Llama, Mistral, Gemma, Phi, and Qwen — capability benchmarks, licence constraints, and deployment considerations

10+ models
6 sections
Apache / MIT custom
Contents
  1. Why open models
  2. Model landscape
  3. Licence comparison
  4. Running locally
  5. Fine-tuning open models
  6. Tools & ecosystem
  7. References
01 — Advantages

Why Open Models

Open-weight models offer four core advantages: Control — you own your weights and can run inference anywhere. Cost — no per-token API fees; run locally for marginal cost. Privacy — data never leaves your infrastructure. Customization — fine-tune and deploy proprietary variants.

The tradeoff: support and quality may lag proprietary models. OpenAI's GPT-4 still outperforms open alternatives on many benchmarks. But open models are rapidly catching up, and in many applications they're good enough and more cost-effective.

💡 Open doesn't mean free: You pay for compute, not tokens. A local Llama 3.3 70B on H100s costs more per inference than Claude API. But if you have bulk usage or strict privacy needs, open models win long-term.
02 — Market

Model Landscape Comparison

ModelParamsContextMMLULicenceBest For
Llama 3.3 70B70B8K86%Meta LlamaProduction, reasoning
Llama 3.1 8B8B128K76%Meta LlamaEdge, latency-sensitive
Mistral 7B7B32K64%Apache 2.0Commercial, BSD-friendly
Mistral NeMo 12B12B128K73%Apache 2.0Small + powerful
Gemma 2 9B9B8K79%Google GemmaQuality on small size
Phi-3.5 Mini3.8B128K69%MITUltra-light, edge
Qwen 2.5 72B72B128K86%Tongyi QianwenChinese, multilingual
DeepSeek-V3671B128K88%DeepSeekFrontier performance

Key Observations

7B-12B is the sweet spot: Mistral 7B, Gemma 2 9B, and Llama 3.1 8B offer strong quality at modest compute cost. Long context: Llama 3.1 and NeMo support 128K context — useful for retrieval and document processing. Multilingual: Qwen and DeepSeek excel at non-English. Tiny models: Phi-3.5 (3.8B) fits on phones, surprisingly capable.

⚠️ MMLU is one benchmark. Pick models by your actual workload: customer support, code, reasoning, etc. Fine-tuned 7B models often beat larger untuned ones on specific tasks.
Python · Compare open models on a custom benchmark
import time, statistics
from openai import OpenAI

# Both vLLM instances expose OpenAI-compatible API
clients = {
    "llama3-8b":    OpenAI(base_url="http://llama3:8000/v1",    api_key="x"),
    "mistral-7b":   OpenAI(base_url="http://mistral:8000/v1",   api_key="x"),
    "gemma2-9b":    OpenAI(base_url="http://gemma2:8000/v1",    api_key="x"),
}

EVAL_PROMPTS = [
    {"q": "What is the capital of France?", "a": "paris"},
    {"q": "Write a Python function to reverse a string.", "a": "def"},
    {"q": "Explain gradient descent in one sentence.", "a": "optimiz"},
]

def benchmark_model(name: str, client: OpenAI) -> dict:
    scores, latencies = [], []
    for item in EVAL_PROMPTS:
        t0 = time.perf_counter()
        resp = client.chat.completions.create(
            model=name,
            messages=[{"role": "user", "content": item["q"]}],
            max_tokens=128, temperature=0.0
        ).choices[0].message.content
        latencies.append(time.perf_counter() - t0)
        scores.append(1 if item["a"].lower() in resp.lower() else 0)
    return {
        "model": name,
        "accuracy": round(statistics.mean(scores), 2),
        "avg_latency_s": round(statistics.mean(latencies), 2)
    }

for name, c in clients.items():
    print(benchmark_model(name, c))
03 — Legal

Licence Comparison

Meta Llama Licence (Llama 2, 3, 3.1, 3.3)

Non-commercial use: Free. Commercial use allowed if you don't compete with Meta (AGI clauses removed in Llama 3.1+). Can't use Claude/GPT outputs to fine-tune. Can't claim your model is Meta's. Includes usage caps in some versions.

Apache 2.0 (Mistral, NeMo)

Fully permissive. Commercial use unrestricted. Must include license notice. No patent grants beyond the models themselves. Standard open-source licence.

Google Gemma Licence

Non-commercial by default. Google Gemma licence restricts to non-commercial use. Commercial licencing available by request.

MIT (Phi-3.5)

Most permissive. Do anything, anywhere. Standard MIT terms.

Tongyi Qianwen (Qwen) & DeepSeek

Custom licences. Qwen allows commercial use with terms. DeepSeek permits academic and commercial use.

💡 For commercial deployment: Prefer Apache 2.0 (Mistral, NeMo) or MIT (Phi). Meta Llama 3+ is safe commercially but read the fine print. Avoid non-commercial licences unless you have explicit permission.
04 — Deployment

Running Locally

ToolGPU VRAMSpeedEaseBest For
Ollama4GB (7B)GoodEasiestLocal macOS/Linux, no setup
llama.cpp2GB (7B)Very fastCLI-onlyCPU inference, extreme efficiency
LM Studio4GB (7B)GoodEasiest (GUI)Windows/Mac, interactive
vLLM16GB+ (70B)FastestPython setupProduction, batch inference
TransformersVariesStandardPythonResearch, custom logic

Quick Start: Ollama

# Install Ollama (macOS/Linux) # https://ollama.ai # Pull a model ollama pull llama2 # Run interactively ollama run llama2 # Or via API curl http://localhost:11434/api/generate -d '{ "model": "llama2", "prompt": "Why is the sky blue?", "stream": false }'

Ollama is the easiest starting point: It handles model downloads, quantization, and caching automatically. ollama run llama2 instantly gives you a chat interface.

Hardware Requirements

7B models: 8GB GPU RAM minimum (4GB with aggressive quantization). CPU inference is 10-50× slower but works. 70B models: 40GB for full precision, 20GB for 4-bit quantization. Requires H100 or multiple GPUs. Phi-3.5 (3.8B): 2GB GPU, runs on most machines.

05 — Customization

Fine-Tuning Open Models

The choice between fine-tuning, RAG, and prompt engineering depends on your use case and compute budget.

Decision Framework

📚 Use RAG When:

  • Knowledge is external (docs, databases)
  • Updates shouldn't require retraining
  • Quick implementation needed

✏️ Use Prompt Engineering When:

  • Few examples available
  • No compute budget
  • Style/format tuning only

Use Fine-Tuning When:

  • Domain-specific knowledge needed
  • Compute available
  • Large training dataset (1000+)

🔄 Use Hybrid When:

  • Fine-tune + RAG for best results
  • Fine-tune for style, RAG for facts
  • Higher implementation cost
Python · Run Llama 3 locally with Ollama (simplest path)
import requests, json

# Install Ollama: https://ollama.com
# Pull model: ollama pull llama3.2

OLLAMA_URL = "http://localhost:11434"

def ollama_chat(model: str, messages: list[dict],
                stream: bool = False) -> str:
    """Chat with a locally-running Ollama model."""
    resp = requests.post(
        f"{OLLAMA_URL}/api/chat",
        json={"model": model, "messages": messages, "stream": stream},
        timeout=120
    )
    resp.raise_for_status()
    return resp.json()["message"]["content"]

def ollama_list_models() -> list[str]:
    """List locally available models."""
    resp = requests.get(f"{OLLAMA_URL}/api/tags")
    return [m["name"] for m in resp.json().get("models", [])]

# OpenAI-compatible API (also supported by Ollama)
from openai import OpenAI
client = OpenAI(base_url=f"{OLLAMA_URL}/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "What is RAG in 2 sentences?"}]
)
print(response.choices[0].message.content)

# Other local options:
# - LM Studio: GUI app, supports GGUF models
# - llama.cpp: C++ inference, minimal dependencies
# - vLLM: production serving, best throughput

LoRA Fine-Tuning Workflow

Step 1: Prepare dataset (1000–10000 examples). Step 2: Use Axolotl/Unsloth with LoRA. Step 3: Merge LoRA weights into base model. Step 4: Deploy merged model or use LoRA in inference.

💡 LoRA is portable: A LoRA adapter is 10–50MB vs full model size (15GB for 7B). Distribute adapters easily, merge at inference time, or stack multiple LoRAs.
06 — Ecosystem

Tools & Frameworks

Runtime
Ollama
Easy, local inference. Model management. No code needed. macOS/Linux/Windows.
Inference
llama.cpp
CPU-first, ultra-efficient. Quantized models. C++ backend.
GUI
LM Studio
Windows/Mac GUI. Model downloads, chat interface, easy setup.
Server
vLLM
Production inference server. Batch processing, OpenAI API-compatible.
Hub
Hugging Face Hub
Model repository, 500K+ open models. Standard distribution format.
UI
Open WebUI
Web chat interface. Works with Ollama, vLLM, OpenAI API.
Format
GGUF
Quantized model format. Small, fast. Used by llama.cpp/Ollama.
Apple
MLX
Apple Silicon optimization. Fast on M1/M2/M3. Python library.
07 — Further Reading

References

Model Cards & Docs
Inference Tools
Practitioner Writing
07 — Further Reading

References

Model Cards & Papers