Llama, Mistral, Gemma, Phi, and Qwen — capability benchmarks, licence constraints, and deployment considerations
Open-weight models offer four core advantages: Control — you own your weights and can run inference anywhere. Cost — no per-token API fees; run locally for marginal cost. Privacy — data never leaves your infrastructure. Customization — fine-tune and deploy proprietary variants.
The tradeoff: support and quality may lag proprietary models. OpenAI's GPT-4 still outperforms open alternatives on many benchmarks. But open models are rapidly catching up, and in many applications they're good enough and more cost-effective.
| Model | Params | Context | MMLU | Licence | Best For |
|---|---|---|---|---|---|
| Llama 3.3 70B | 70B | 8K | 86% | Meta Llama | Production, reasoning |
| Llama 3.1 8B | 8B | 128K | 76% | Meta Llama | Edge, latency-sensitive |
| Mistral 7B | 7B | 32K | 64% | Apache 2.0 | Commercial, BSD-friendly |
| Mistral NeMo 12B | 12B | 128K | 73% | Apache 2.0 | Small + powerful |
| Gemma 2 9B | 9B | 8K | 79% | Google Gemma | Quality on small size |
| Phi-3.5 Mini | 3.8B | 128K | 69% | MIT | Ultra-light, edge |
| Qwen 2.5 72B | 72B | 128K | 86% | Tongyi Qianwen | Chinese, multilingual |
| DeepSeek-V3 | 671B | 128K | 88% | DeepSeek | Frontier performance |
7B-12B is the sweet spot: Mistral 7B, Gemma 2 9B, and Llama 3.1 8B offer strong quality at modest compute cost. Long context: Llama 3.1 and NeMo support 128K context — useful for retrieval and document processing. Multilingual: Qwen and DeepSeek excel at non-English. Tiny models: Phi-3.5 (3.8B) fits on phones, surprisingly capable.
import time, statistics
from openai import OpenAI
# Both vLLM instances expose OpenAI-compatible API
clients = {
"llama3-8b": OpenAI(base_url="http://llama3:8000/v1", api_key="x"),
"mistral-7b": OpenAI(base_url="http://mistral:8000/v1", api_key="x"),
"gemma2-9b": OpenAI(base_url="http://gemma2:8000/v1", api_key="x"),
}
EVAL_PROMPTS = [
{"q": "What is the capital of France?", "a": "paris"},
{"q": "Write a Python function to reverse a string.", "a": "def"},
{"q": "Explain gradient descent in one sentence.", "a": "optimiz"},
]
def benchmark_model(name: str, client: OpenAI) -> dict:
scores, latencies = [], []
for item in EVAL_PROMPTS:
t0 = time.perf_counter()
resp = client.chat.completions.create(
model=name,
messages=[{"role": "user", "content": item["q"]}],
max_tokens=128, temperature=0.0
).choices[0].message.content
latencies.append(time.perf_counter() - t0)
scores.append(1 if item["a"].lower() in resp.lower() else 0)
return {
"model": name,
"accuracy": round(statistics.mean(scores), 2),
"avg_latency_s": round(statistics.mean(latencies), 2)
}
for name, c in clients.items():
print(benchmark_model(name, c))
Non-commercial use: Free. Commercial use allowed if you don't compete with Meta (AGI clauses removed in Llama 3.1+). Can't use Claude/GPT outputs to fine-tune. Can't claim your model is Meta's. Includes usage caps in some versions.
Fully permissive. Commercial use unrestricted. Must include license notice. No patent grants beyond the models themselves. Standard open-source licence.
Non-commercial by default. Google Gemma licence restricts to non-commercial use. Commercial licencing available by request.
Most permissive. Do anything, anywhere. Standard MIT terms.
Custom licences. Qwen allows commercial use with terms. DeepSeek permits academic and commercial use.
| Tool | GPU VRAM | Speed | Ease | Best For |
|---|---|---|---|---|
| Ollama | 4GB (7B) | Good | Easiest | Local macOS/Linux, no setup |
| llama.cpp | 2GB (7B) | Very fast | CLI-only | CPU inference, extreme efficiency |
| LM Studio | 4GB (7B) | Good | Easiest (GUI) | Windows/Mac, interactive |
| vLLM | 16GB+ (70B) | Fastest | Python setup | Production, batch inference |
| Transformers | Varies | Standard | Python | Research, custom logic |
Ollama is the easiest starting point: It handles model downloads, quantization, and caching automatically. ollama run llama2 instantly gives you a chat interface.
7B models: 8GB GPU RAM minimum (4GB with aggressive quantization). CPU inference is 10-50× slower but works. 70B models: 40GB for full precision, 20GB for 4-bit quantization. Requires H100 or multiple GPUs. Phi-3.5 (3.8B): 2GB GPU, runs on most machines.
The choice between fine-tuning, RAG, and prompt engineering depends on your use case and compute budget.
import requests, json
# Install Ollama: https://ollama.com
# Pull model: ollama pull llama3.2
OLLAMA_URL = "http://localhost:11434"
def ollama_chat(model: str, messages: list[dict],
stream: bool = False) -> str:
"""Chat with a locally-running Ollama model."""
resp = requests.post(
f"{OLLAMA_URL}/api/chat",
json={"model": model, "messages": messages, "stream": stream},
timeout=120
)
resp.raise_for_status()
return resp.json()["message"]["content"]
def ollama_list_models() -> list[str]:
"""List locally available models."""
resp = requests.get(f"{OLLAMA_URL}/api/tags")
return [m["name"] for m in resp.json().get("models", [])]
# OpenAI-compatible API (also supported by Ollama)
from openai import OpenAI
client = OpenAI(base_url=f"{OLLAMA_URL}/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "What is RAG in 2 sentences?"}]
)
print(response.choices[0].message.content)
# Other local options:
# - LM Studio: GUI app, supports GGUF models
# - llama.cpp: C++ inference, minimal dependencies
# - vLLM: production serving, best throughput
Step 1: Prepare dataset (1000–10000 examples). Step 2: Use Axolotl/Unsloth with LoRA. Step 3: Merge LoRA weights into base model. Step 4: Deploy merged model or use LoRA in inference.