Fine-tuning LLMs

Contents

Prompting vs fine-tuning
The fine-tuning pipeline
Data requirements
LoRA and QLoRA
Working code example
What to explore next
References

01 — Foundation

Prompting vs fine-tuning: the decision

Fine-tuning updates a model's weights on your data — teaching it behaviour that prompting cannot reliably produce. Use it when prompting has hit its ceiling.

💡

Golden Rule: Always establish a prompted baseline before fine-tuning. Fine-tuning that does not beat prompting is wasted compute.

When to use each approach

Scenario	Prompting is enough	Fine-tune
Model understands task	✓ Use few-shot examples in prompt
Consistent format/tone	✓ Specify in instructions
Fewer than ~100 examples	✓ In-context learning works
Prompt too long/expensive		✓ Encode in weights
Proprietary style/domain		✓ Model must learn it
Latency critical		✓ Smaller FT model beats large prompted one
500+ quality examples		✓ Enough data to fine-tune

02 — Process

The fine-tuning pipeline

A complete fine-tuning workflow from raw data to deployed model.

The full pipeline

Collect data: Examples of the behaviour you want to teach. Start with 100–500 curated examples; quality beats quantity.

Format JSONL: Convert to OpenAI chat format or raw text. Each row is one training example.

Train (QLoRA): Use parameter-efficient fine-tuning to update <1% of model weights. Fit on consumer GPUs.

Eval (LLM judge/RAGAS): Compare fine-tuned model to prompted baseline. Measure on held-out test set.

Deploy: Merge LoRA weights into base model (or keep separate). Use in production inference.

Key metrics

Training loss: Should decrease smoothly. Stop if plateaus.
Validation accuracy: Compare to prompted baseline on same test set.
Inference latency: LoRA adds minimal overhead (~2–5%).
Compute cost: QLoRA on 8B model ≈ 24 GB VRAM, trains in hours.

03 — Foundation

Data requirements

The biggest mistake: training on too much noisy data instead of curating small, high-quality datasets.

✓

Quality over quantity: 500 curated examples beats 50,000 noisy ones. Spend time on data, not scale.

Principles

Examples per class

200–300 per label/task variant
Binary classification: 300–500 pairs
Multi-class (5+ classes): 100 per class min

Data curation

Remove duplicates
Filter out edge cases first
Manually review random samples
Stratify by class/difficulty

JSONL format (OpenAI chat)

{"messages": [{"role": "user", "content": "Classify: The service was excellent."}, {"role": "assistant", "content": "Positive"}]} {"messages": [{"role": "user", "content": "Classify: I waited 2 hours."}, {"role": "assistant", "content": "Negative"}]}

Raw text format

The service was excellent. ### Positive I waited 2 hours. ### Negative

04 — Technique

LoRA and QLoRA: efficient fine-tuning

Parameter-efficient fine-tuning (PEFT) lets you fine-tune large models on small GPUs by updating only a tiny fraction of weights.

How LoRA works

Instead of updating all model weights (billions), LoRA adds small trainable "adapter" matrices to key layers. After training, merge them into the base model or keep separate.

Trainable params: Typically 0.1–1% of total model size
Rank (r): Usually 8, 16, or 32. Higher = more capacity but slower
Memory: ~20GB for 7B model on single GPU

QLoRA: LoRA + quantization

Quantize the base model to 4-bit, then add LoRA on top. Fit 13B models on consumer GPUs.

Memory: ~8GB VRAM for 13B model
Speed: Slower than LoRA alone (~30% overhead), but worth it for memory savings
Accuracy: Minimal quality loss vs full fine-tuning

→

In practice: Start with QLoRA. If you have GPU memory, try LoRA. Both typically outperform full fine-tuning on quality/compute tradeoff.

05 — Implementation

Working code example

Complete QLoRA fine-tuning script using HuggingFace TRL. Prerequisites:

pip install trl peft transformers bitsandbytes datasets

Full script

# QLoRA fine-tuning with HuggingFace TRL from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import LoraConfig, get_peft_model from trl import SFTTrainer, SFTConfig import torch model_id = 'meta-llama/Meta-Llama-3-8B-Instruct' # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_compute_dtype=torch.bfloat16, ) # Load quantized model model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map='auto' ) tokenizer = AutoTokenizer.from_pretrained(model_id) # LoRA config lora_cfg = LoraConfig( r=16, lora_alpha=32, target_modules='all-linear', lora_dropout=0.05, task_type='CAUSAL_LM' ) model = get_peft_model(model, lora_cfg) model.print_trainable_parameters() # e.g. 0.53% of 8B params # Load training data (JSONL) dataset = load_dataset('json', data_files='train.jsonl', split='train') # Train trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, args=SFTConfig( output_dir='./ft-output', num_train_epochs=3, per_device_train_batch_size=4, gradient_accumulation_steps=4, learning_rate=2e-4, bf16=True, logging_steps=10 ) ) trainer.train()

Next: merge & inference

# Merge LoRA weights model = model.merge_and_unload() model.save_pretrained('./ft-model') tokenizer.save_pretrained('./ft-model') # Or use as separate adapter from peft import PeftModel base_model = AutoModelForCausalLM.from_pretrained(model_id) model = PeftModel.from_pretrained(base_model, './ft-output') model = model.merge_and_unload()

06 — Next Steps

What to explore next

These four concept pages go deeper on related topics.

PEFT Methods

Deep dive into LoRA, QLoRA, IA³, and other parameter-efficient techniques. When to use each, limitations, and advanced configurations.

Alignment & RLHF

Fine-tune for safety and alignment. Covers RLHF, DPO, ORPO, and other post-training techniques for steering model behaviour.

Training Tools

Ecosystem of fine-tuning frameworks: TRL, Unsloth, Axolotl, LLaMA-Factory. Benchmarks and when to use each.

Data Preparation

Collecting, cleaning, and formatting training data. Synthetic data, annotation strategies, and quality control.

07 — Learn More

References

Papers

PAPER LoRA: Low-Rank Adaptation of Large Language Models
Hu et al. (2021). The foundational work on parameter-efficient fine-tuning via low-rank matrices. arXiv:2106.09685
PAPER QLoRA: Efficient Finetuning of Quantized LLMs
Dettmers et al. (2023). Combines quantization with LoRA for memory-efficient fine-tuning. arXiv:2305.14314

Blogs & Tutorials

BLOG Fine-tuning LLMs with TRL
HuggingFace. Comprehensive guide to the Transformers Reinforcement Learning (TRL) library for supervised fine-tuning and alignment.

Documentation

DOCS TRL SFTTrainer
Official documentation for supervised fine-tuning with the TRL library. Configuration, loss functions, and integration with PEFT.
DOCS Anthropic Fine-tuning
Claude fine-tuning API documentation. Model versioning, batch processing, and pricing.

LEARNING PATH

Learning Path

Fine-tuning is powerful but often unnecessary. Here's the decision tree and learning sequence:

Promptingtry first

→

RAGfor knowledge

→

SFT LoRAstyle / format

→

DPOpreference alignment

→

RLHFfull alignment

Exhaust prompting and RAG first

Fine-tuning is expensive and slow to iterate. If your problem is about knowledge (the model doesn't know something), use RAG. If it's about format or style, few-shot prompting often works. Fine-tune only when both have a ceiling.

Start with LoRA on a small model

Fine-tune Llama 3 8B with LoRA using unsloth or trl. LoRA adds trainable low-rank adapters and reduces GPU memory by 3–5x vs. full fine-tuning. A single A100 handles 8B comfortably.

Curate data obsessively

500 high-quality examples consistently beat 50,000 noisy ones. Spend your first week on data curation, not training. Look at every example that influences task-critical behaviours.

Evaluate with the same rigour as a model release

Build an eval set before training. Measure task accuracy, format compliance, and regression on general capabilities. A fine-tuned model that's worse at everything else is not a success.

Fine-tuning LLMs

Prompting vs fine-tuning: the decision

When to use each approach

The fine-tuning pipeline

The full pipeline

Key metrics

Data requirements

Principles

Examples per class

Data curation

JSONL format (OpenAI chat)

Raw text format

LoRA and QLoRA: efficient fine-tuning

How LoRA works

QLoRA: LoRA + quantization

Working code example

Full script

Next: merge & inference

What to explore next

PEFT Methods

Alignment & RLHF

Training Tools

Data Preparation

References

Learning Path

Exhaust prompting and RAG first

Start with LoRA on a small model

Curate data obsessively

Evaluate with the same rigour as a model release

Related concepts