01 — Foundation
Prompting vs fine-tuning: the decision
Fine-tuning updates a model's weights on your data — teaching it behaviour that prompting cannot reliably produce. Use it when prompting has hit its ceiling.
💡
Golden Rule: Always establish a prompted baseline before fine-tuning. Fine-tuning that does not beat prompting is wasted compute.
When to use each approach
| Scenario |
Prompting is enough |
Fine-tune |
| Model understands task |
✓ Use few-shot examples in prompt |
|
| Consistent format/tone |
✓ Specify in instructions |
|
| Fewer than ~100 examples |
✓ In-context learning works |
|
| Prompt too long/expensive |
|
✓ Encode in weights |
| Proprietary style/domain |
|
✓ Model must learn it |
| Latency critical |
|
✓ Smaller FT model beats large prompted one |
| 500+ quality examples |
|
✓ Enough data to fine-tune |
02 — Process
The fine-tuning pipeline
A complete fine-tuning workflow from raw data to deployed model.
The full pipeline
1
Collect data: Examples of the behaviour you want to teach. Start with 100–500 curated examples; quality beats quantity.
2
Format JSONL: Convert to OpenAI chat format or raw text. Each row is one training example.
3
Train (QLoRA): Use parameter-efficient fine-tuning to update <1% of model weights. Fit on consumer GPUs.
4
Eval (LLM judge/RAGAS): Compare fine-tuned model to prompted baseline. Measure on held-out test set.
5
Deploy: Merge LoRA weights into base model (or keep separate). Use in production inference.
Key metrics
- Training loss: Should decrease smoothly. Stop if plateaus.
- Validation accuracy: Compare to prompted baseline on same test set.
- Inference latency: LoRA adds minimal overhead (~2–5%).
- Compute cost: QLoRA on 8B model ≈ 24 GB VRAM, trains in hours.
03 — Foundation
Data requirements
The biggest mistake: training on too much noisy data instead of curating small, high-quality datasets.
✓
Quality over quantity: 500 curated examples beats 50,000 noisy ones. Spend time on data, not scale.
Principles
Examples per class
- 200–300 per label/task variant
- Binary classification: 300–500 pairs
- Multi-class (5+ classes): 100 per class min
Data curation
- Remove duplicates
- Filter out edge cases first
- Manually review random samples
- Stratify by class/difficulty
JSONL format (OpenAI chat)
{"messages": [{"role": "user", "content": "Classify: The service was excellent."}, {"role": "assistant", "content": "Positive"}]}
{"messages": [{"role": "user", "content": "Classify: I waited 2 hours."}, {"role": "assistant", "content": "Negative"}]}
Raw text format
The service was excellent. ### Positive
I waited 2 hours. ### Negative
04 — Technique
LoRA and QLoRA: efficient fine-tuning
Parameter-efficient fine-tuning (PEFT) lets you fine-tune large models on small GPUs by updating only a tiny fraction of weights.
How LoRA works
Instead of updating all model weights (billions), LoRA adds small trainable "adapter" matrices to key layers. After training, merge them into the base model or keep separate.
- Trainable params: Typically 0.1–1% of total model size
- Rank (r): Usually 8, 16, or 32. Higher = more capacity but slower
- Memory: ~20GB for 7B model on single GPU
QLoRA: LoRA + quantization
Quantize the base model to 4-bit, then add LoRA on top. Fit 13B models on consumer GPUs.
- Memory: ~8GB VRAM for 13B model
- Speed: Slower than LoRA alone (~30% overhead), but worth it for memory savings
- Accuracy: Minimal quality loss vs full fine-tuning
→
In practice: Start with QLoRA. If you have GPU memory, try LoRA. Both typically outperform full fine-tuning on quality/compute tradeoff.
05 — Implementation
Working code example
Complete QLoRA fine-tuning script using HuggingFace TRL. Prerequisites:
pip install trl peft transformers bitsandbytes datasets
Full script
# QLoRA fine-tuning with HuggingFace TRL
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig
import torch
model_id = 'meta-llama/Meta-Llama-3-8B-Instruct'
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
model_id, quantization_config=bnb_config, device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# LoRA config
lora_cfg = LoraConfig(
r=16, lora_alpha=32,
target_modules='all-linear',
lora_dropout=0.05,
task_type='CAUSAL_LM'
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters() # e.g. 0.53% of 8B params
# Load training data (JSONL)
dataset = load_dataset('json', data_files='train.jsonl', split='train')
# Train
trainer = SFTTrainer(
model=model, tokenizer=tokenizer, train_dataset=dataset,
args=SFTConfig(
output_dir='./ft-output',
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10
)
)
trainer.train()
Next: merge & inference
# Merge LoRA weights
model = model.merge_and_unload()
model.save_pretrained('./ft-model')
tokenizer.save_pretrained('./ft-model')
# Or use as separate adapter
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(model_id)
model = PeftModel.from_pretrained(base_model, './ft-output')
model = model.merge_and_unload()
06 — Next Steps
What to explore next
These four concept pages go deeper on related topics.
1
Deep dive into LoRA, QLoRA, IA³, and other parameter-efficient techniques. When to use each, limitations, and advanced configurations.
4
Collecting, cleaning, and formatting training data. Synthetic data, annotation strategies, and quality control.
07 — Learn More
References
Papers
Blogs & Tutorials
Documentation
- DOCS TRL SFTTrainer
Official documentation for supervised fine-tuning with the TRL library. Configuration, loss functions, and integration with PEFT.
- DOCS Anthropic Fine-tuning
Claude fine-tuning API documentation. Model versioning, batch processing, and pricing.
LEARNING PATH
Learning Path
Fine-tuning is powerful but often unnecessary. Here's the decision tree and learning sequence:
Promptingtry first
→
RAGfor knowledge
→
SFT LoRAstyle / format
→
DPOpreference alignment
→
RLHFfull alignment
1
Exhaust prompting and RAG first
Fine-tuning is expensive and slow to iterate. If your problem is about knowledge (the model doesn't know something), use RAG. If it's about format or style, few-shot prompting often works. Fine-tune only when both have a ceiling.
2
Start with LoRA on a small model
Fine-tune Llama 3 8B with LoRA using unsloth or trl. LoRA adds trainable low-rank adapters and reduces GPU memory by 3–5x vs. full fine-tuning. A single A100 handles 8B comfortably.
3
Curate data obsessively
500 high-quality examples consistently beat 50,000 noisy ones. Spend your first week on data curation, not training. Look at every example that influences task-critical behaviours.
4
Evaluate with the same rigour as a model release
Build an eval set before training. Measure task accuracy, format compliance, and regression on general capabilities. A fine-tuned model that's worse at everything else is not a success.