FINE-TUNING

Fine-tuning LLMs

Teach a model new behaviour by updating its weights — when prompting alone isn't enough

prompting → fine-tuning the decision
<1% of weights updated with LoRA/QLoRA
500 curated > 50k noisy data quality rule
Contents
  1. Prompting vs fine-tuning
  2. The fine-tuning pipeline
  3. Data requirements
  4. LoRA and QLoRA
  5. Working code example
  6. What to explore next
  7. References
01 — Foundation

Prompting vs fine-tuning: the decision

Fine-tuning updates a model's weights on your data — teaching it behaviour that prompting cannot reliably produce. Use it when prompting has hit its ceiling.

💡
Golden Rule: Always establish a prompted baseline before fine-tuning. Fine-tuning that does not beat prompting is wasted compute.

When to use each approach

Scenario Prompting is enough Fine-tune
Model understands task ✓ Use few-shot examples in prompt
Consistent format/tone ✓ Specify in instructions
Fewer than ~100 examples ✓ In-context learning works
Prompt too long/expensive ✓ Encode in weights
Proprietary style/domain ✓ Model must learn it
Latency critical ✓ Smaller FT model beats large prompted one
500+ quality examples ✓ Enough data to fine-tune
02 — Process

The fine-tuning pipeline

A complete fine-tuning workflow from raw data to deployed model.

The full pipeline

1
Collect data: Examples of the behaviour you want to teach. Start with 100–500 curated examples; quality beats quantity.
2
Format JSONL: Convert to OpenAI chat format or raw text. Each row is one training example.
3
Train (QLoRA): Use parameter-efficient fine-tuning to update <1% of model weights. Fit on consumer GPUs.
4
Eval (LLM judge/RAGAS): Compare fine-tuned model to prompted baseline. Measure on held-out test set.
5
Deploy: Merge LoRA weights into base model (or keep separate). Use in production inference.

Key metrics

03 — Foundation

Data requirements

The biggest mistake: training on too much noisy data instead of curating small, high-quality datasets.

Quality over quantity: 500 curated examples beats 50,000 noisy ones. Spend time on data, not scale.

Principles

Examples per class

  • 200–300 per label/task variant
  • Binary classification: 300–500 pairs
  • Multi-class (5+ classes): 100 per class min

Data curation

  • Remove duplicates
  • Filter out edge cases first
  • Manually review random samples
  • Stratify by class/difficulty

JSONL format (OpenAI chat)

{"messages": [{"role": "user", "content": "Classify: The service was excellent."}, {"role": "assistant", "content": "Positive"}]} {"messages": [{"role": "user", "content": "Classify: I waited 2 hours."}, {"role": "assistant", "content": "Negative"}]}

Raw text format

The service was excellent. ### Positive I waited 2 hours. ### Negative
04 — Technique

LoRA and QLoRA: efficient fine-tuning

Parameter-efficient fine-tuning (PEFT) lets you fine-tune large models on small GPUs by updating only a tiny fraction of weights.

How LoRA works

Instead of updating all model weights (billions), LoRA adds small trainable "adapter" matrices to key layers. After training, merge them into the base model or keep separate.

QLoRA: LoRA + quantization

Quantize the base model to 4-bit, then add LoRA on top. Fit 13B models on consumer GPUs.

In practice: Start with QLoRA. If you have GPU memory, try LoRA. Both typically outperform full fine-tuning on quality/compute tradeoff.
05 — Implementation

Working code example

Complete QLoRA fine-tuning script using HuggingFace TRL. Prerequisites:

pip install trl peft transformers bitsandbytes datasets

Full script

# QLoRA fine-tuning with HuggingFace TRL from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model from trl import SFTTrainer, SFTConfig import torch model_id = 'meta-llama/Meta-Llama-3-8B-Instruct' # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16, ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map='auto' ) tokenizer = AutoTokenizer.from_pretrained(model_id) # LoRA config lora_cfg = LoraConfig( r=16, lora_alpha=32, target_modules='all-linear', lora_dropout=0.05, task_type='CAUSAL_LM' ) model = get_peft_model(model, lora_cfg) model.print_trainable_parameters() # e.g. 0.53% of 8B params # Load training data (JSONL) dataset = load_dataset('json', data_files='train.jsonl', split='train') # Train trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=SFTConfig( output_dir='./ft-output', num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, bf16=True, logging_steps=10 ) ) trainer.train()

Next: merge & inference

# Merge LoRA weights model = model.merge_and_unload() model.save_pretrained('./ft-model') tokenizer.save_pretrained('./ft-model') # Or use as separate adapter from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained(model_id) model = PeftModel.from_pretrained(base_model, './ft-output') model = model.merge_and_unload()
06 — Next Steps

What to explore next

These four concept pages go deeper on related topics.

1

PEFT Methods

Deep dive into LoRA, QLoRA, IA³, and other parameter-efficient techniques. When to use each, limitations, and advanced configurations.

2

Alignment & RLHF

Fine-tune for safety and alignment. Covers RLHF, DPO, ORPO, and other post-training techniques for steering model behaviour.

3

Training Tools

Ecosystem of fine-tuning frameworks: TRL, Unsloth, Axolotl, LLaMA-Factory. Benchmarks and when to use each.

4

Data Preparation

Collecting, cleaning, and formatting training data. Synthetic data, annotation strategies, and quality control.

07 — Learn More

References

Papers
Blogs & Tutorials
Documentation
LEARNING PATH

Learning Path

Fine-tuning is powerful but often unnecessary. Here's the decision tree and learning sequence:

Promptingtry first
RAGfor knowledge
SFT LoRAstyle / format
DPOpreference alignment
RLHFfull alignment
1

Exhaust prompting and RAG first

Fine-tuning is expensive and slow to iterate. If your problem is about knowledge (the model doesn't know something), use RAG. If it's about format or style, few-shot prompting often works. Fine-tune only when both have a ceiling.

2

Start with LoRA on a small model

Fine-tune Llama 3 8B with LoRA using unsloth or trl. LoRA adds trainable low-rank adapters and reduces GPU memory by 3–5x vs. full fine-tuning. A single A100 handles 8B comfortably.

3

Curate data obsessively

500 high-quality examples consistently beat 50,000 noisy ones. Spend your first week on data curation, not training. Look at every example that influences task-critical behaviours.

4

Evaluate with the same rigour as a model release

Build an eval set before training. Measure task accuracy, format compliance, and regression on general capabilities. A fine-tuned model that's worse at everything else is not a success.