Meta's Llama 3.1 family — 8B, 70B, and 405B — is the most capable open-weight model series as of 2024. 128K context, multilingual, function calling, and permissive commercial licence. The 70B outperforms GPT-4o on several benchmarks.
Meta's Llama 3.1 (released July 2024) is the most widely used open-weight LLM family. The three sizes cover different deployment scenarios: 8B for edge/consumer GPUs, 70B for server deployment where cost matters, 405B for maximum quality when you need GPT-4-level capability with data sovereignty.
Key improvements over Llama 2: 128K context window (up from 4K), multilingual support (English, German, French, Italian, Portuguese, Hindi, Spanish, Thai), improved instruction following, and official function/tool calling support. The 405B model scores above GPT-4o on MT-Bench and several coding benchmarks.
Licence: Meta's custom licence allows commercial use for products with fewer than 700M monthly active users (effectively unrestricted for most companies). Fine-tuned derivatives must retain the Llama licence.
Llama 3.1 comes in two flavours for each size: base models (pre-trained, for fine-tuning) and Instruct models (fine-tuned for chat/instruction following with RLHF/DPO).
| Model | VRAM (fp16) | VRAM (4-bit) | Tokens/sec (A100) |
|---|---|---|---|
| Llama-3.1-8B-Instruct | 16 GB | 5 GB | ~120 tok/s |
| Llama-3.1-70B-Instruct | 140 GB | 40 GB | ~25 tok/s |
| Llama-3.1-405B-Instruct | 810 GB | 230 GB | ~5 tok/s |
The 70B at 4-bit fits on a single A100 (80GB), making it the go-to choice for production deployments where you want high quality without multi-node complexity. The 8B at 4-bit fits on any modern consumer GPU (RTX 3090, M2 MacBook Pro).
# Install Ollama
curl https://ollama.ai/install.sh | sh
# Pull and run Llama 3.1 8B
ollama run llama3.1
# Or 70B (needs 40GB VRAM or RAM for CPU inference)
ollama run llama3.1:70b
# Use via Python with the OpenAI client
pip install openai
from openai import OpenAI
# Ollama serves an OpenAI-compatible API
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1",
messages=[
{"role": "system", "content": "You are a helpful Python expert."},
{"role": "user", "content": "Write a function to parse a CSV without pandas."},
],
temperature=0.1,
)
print(response.choices[0].message.content)
If you don't want to self-host, several providers offer Llama 3.1 inference via an OpenAI-compatible API:
import openai
# Groq — fastest inference (200+ tok/s on 70B via custom hardware)
client = openai.OpenAI(
base_url="https://api.groq.com/openai/v1",
api_key="gsk_...",
)
response = client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[{"role": "user", "content": "Explain attention in one paragraph."}],
)
# Together AI
client = openai.OpenAI(
base_url="https://api.together.xyz/v1",
api_key="...",
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Hello!"}],
)
# Fireworks AI, Replicate, HF Inference Endpoints also support Llama 3.1
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
# Load with 4-bit quantisation via Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
lora_alpha=16,
use_gradient_checkpointing="unsloth",
)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=your_dataset,
dataset_text_field="text",
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
max_steps=60,
learning_rate=2e-4,
output_dir="outputs",
),
)
trainer.train()
model.save_pretrained("llama31-finetuned")
Llama 3.1 Instruct models support tool calling using a JSON-based schema similar to OpenAI's format. The model outputs a structured tool call when it determines an external function is needed.
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
},
}
]
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
tool_call = response.choices[0].message.tool_calls[0]
print(tool_call.function.name) # get_weather
print(tool_call.function.arguments) # {"city": "Tokyo", "unit": "celsius"}
Chat template matters: Llama 3.1 uses a specific chat template with special tokens (<|begin_of_text|>, <|start_header_id|>, etc.). Always use the tokenizer's apply_chat_template() method rather than formatting manually. Wrong template = garbage output.
System prompt for safety: Llama 3.1 Instruct has a default system prompt built in that adds safety refusals. For use cases where this is too restrictive, you can override it with your own system prompt. The base model has no such restrictions.
405B deployment: At fp16, 405B needs 8×A100 80GB. Most teams use 4-bit quantisation (fits in 4×A100) or access it via API (Groq, Together). Running it yourself is only worthwhile if you have the hardware and strict data residency requirements.
Context length vs quality: The 128K context window works, but quality degrades for very long inputs (the "lost in the middle" problem). For RAG, still prefer chunking + retrieval over stuffing 100K tokens into context.
Llama 3 represents Meta's most capable open-weight model release, featuring significantly improved instruction following, reasoning, and multilingual capabilities compared to Llama 2. The model family spans sizes from 8B to 405B parameters, with versions fine-tuned for instruction following (Instruct) and base models for further fine-tuning.
| Model | Parameters | Context | MMLU | Best Use Case |
|---|---|---|---|---|
| Llama 3.1 8B Instruct | 8B | 128K | ~73% | Edge / high-volume |
| Llama 3.1 70B Instruct | 70B | 128K | ~86% | Production workhorse |
| Llama 3.1 405B Instruct | 405B | 128K | ~88% | Frontier tasks |
| Llama 3.2 1B/3B | 1–3B | 128K | ~50–60% | On-device, mobile |
| Llama 3.2 11B/90B Vision | 11–90B | 128K | ~73–86% | Multimodal tasks |
Llama 3's tokenizer uses a 128K vocabulary size, compared to Llama 2's 32K vocabulary. The larger vocabulary improves tokenization efficiency for code, mathematical notation, and non-English languages, encoding the same content in fewer tokens. This directly reduces inference costs and context window consumption for applications processing multilingual or technical content. The vocabulary expansion also improves the model's ability to handle rare proper nouns and technical terms without fragmenting them into multiple subword tokens.
The Llama 3 instruction format uses a specific special token structure for conversation turns that must be followed exactly when deploying base models with custom system prompts or when constructing multi-turn conversation contexts manually. The format uses BOS, system, user, and assistant role markers as special tokens rather than human-readable strings, ensuring they are never confused with user-provided content. Incorrect conversation format — especially missing end-of-turn tokens — is the most common cause of Llama 3 generation quality degradation when moving from API to self-hosted deployments.
Llama 3's training data of 15 trillion tokens represents a 7× scale-up from Llama 2's 2 trillion tokens, with careful curation to improve data quality rather than simply scaling volume. The training mixture includes more code, reasoning-heavy content, and multilingual data than previous iterations. Post-training alignment uses a combination of supervised fine-tuning, rejection sampling fine-tuning, proximal policy optimization, and direct preference optimization, with each stage targeting different quality dimensions including instruction following, safety, and reasoning quality.
Function calling in Llama 3 Instruct models follows a JSON-based tool definition format compatible with the OpenAI tool calling interface, enabling direct substitution in applications built against that standard. The model generates tool calls as structured JSON objects within its response, which application code parses and executes. Multi-turn tool use — where the model makes a tool call, receives the result, and continues reasoning — is supported through the conversation format, enabling agent loops with Llama 3 that require minimal custom code beyond standard chat completion API calls.
Llama 3's safety training uses adversarial red-teaming datasets that include jailbreak attempts, harmful instruction following, and sensitive topic handling. Despite strong safety performance on standard benchmarks, Llama 3 Instruct models are more permissive than Claude or GPT-4 on certain edge cases by design — Meta explicitly calibrated the safety-utility trade-off toward helpfulness for developers who need to build applications with custom safety layers rather than relying entirely on model-level refusals. Organizations deploying Llama 3 should plan to implement domain-specific safety evaluation and guardrail layers appropriate for their use case.
Quantized Llama 3 variants maintain strong quality relative to full-precision at 4-bit and 8-bit precision levels, making them practical choices for memory-constrained deployments. The 70B model in Q4_K_M quantization requires approximately 40GB of VRAM — fitting on 2× A100 40GB GPUs or a single A100 80GB — while retaining performance comparable to the full FP16 model on most tasks. This accessibility has made quantized Llama 3 70B one of the most widely deployed open-weight models for production use cases requiring near-frontier quality at reduced infrastructure cost.