Training Tools

Unsloth

2–5× faster and 60% less memory than standard QLoRA — hand-written CUDA kernels replace HuggingFace's attention and RoPE implementations for dramatically better training throughput on consumer GPUs.

2–5×
Faster training
60%
Less VRAM
Free tier
Available

Table of Contents

SECTION 01

Why Unsloth is faster

Standard HuggingFace/PyTorch training computes attention and RoPE using autograd — every operation is tracked for the backward pass, with intermediate activations stored in GPU memory. Unsloth replaces these with hand-written CUDA kernels that: (1) fuse multiple operations into one kernel (fewer memory round-trips), (2) use tiling to reduce HBM bandwidth, and (3) recompute some activations instead of storing them.

The result: Unsloth's attention kernel is ~2× faster than Flash Attention 2 on single-GPU training, and uses 60% less VRAM than equivalent QLoRA training with standard TRL. This means a model that would require an A100 80GB with standard tools trains on an RTX 4090 24GB with Unsloth.

Unsloth is actively maintained and supports Llama 3.x, Qwen 2.5, Mistral, Phi-3, Gemma 2, and DeepSeek. The free tier runs on any NVIDIA GPU; the Pro tier adds multi-GPU and AMD support.

SECTION 02

Installation

# CUDA 12.1+ required
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

# Verify installation
import unsloth
print(unsloth.__version__)

# For Google Colab (CUDA 12.2):
# pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# For local with CUDA 12.1:
# pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"

# For Jupyter Notebook — check CUDA version first:
import subprocess
result = subprocess.run(["nvcc", "--version"], capture_output=True, text=True)
print(result.stdout)  # should show CUDA 12.x
SECTION 03

Fine-tuning with Unsloth

from unsloth import FastLanguageModel
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

# Load model with Unsloth's optimised loader
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,              # auto-detect: bf16 on Ampere+
    load_in_4bit=True,       # QLoRA
)

# Add LoRA adapters — Unsloth's version is drop-in compatible with PEFT
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,          # Unsloth recommends 0 for speed
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth's optimised GC
    random_state=42,
)
print(model.print_trainable_parameters())

# Train — same as standard TRL
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        output_dir="./unsloth-output",
        max_seq_length=2048,
    ),
)
trainer.train()
SECTION 04

Exporting to GGUF

# Unsloth has built-in GGUF export — much simpler than manual conversion

# Option 1: Save merged model then export to GGUF
model.save_pretrained_merged("./merged-model", tokenizer,
    save_method="merged_16bit")  # or "merged_4bit_forced", "lora"

# Option 2: Save directly as GGUF (various quantisation levels)
model.save_pretrained_gguf("./gguf-model", tokenizer,
    quantization_method="q4_k_m")   # q4_k_m, q8_0, f16, q5_k_m, etc.

# Option 3: Push directly to HuggingFace Hub as GGUF
model.push_to_hub_gguf(
    "your-username/llama3-finetuned-gguf",
    tokenizer,
    quantization_method=["q4_k_m", "q8_0"],   # upload multiple quants
    token="hf_...",
)

# Then use with Ollama:
# ollama create my-model -f Modelfile
# where Modelfile contains: FROM ./gguf-model/model-Q4_K_M.gguf
SECTION 05

Benchmarks vs standard TRL

Measured on Llama 3.1 8B, single RTX 4090 (24GB), 2048 tokens, batch=4, rank=16:

benchmarks = {
    "method": ["Standard QLoRA (TRL)", "Flash Attention 2 + TRL", "Unsloth"],
    "tokens_per_second": [1420, 2080, 3890],
    "vram_gb":           [22.1, 20.8, 9.4],
    "time_per_epoch_min": [48,   33,   17],
}
# Unsloth: 2.7x faster than standard, 60% less VRAM

# On A100 80GB (Llama 3.1 70B, rank=64):
# Standard QLoRA:  210 tok/s, 72 GB VRAM
# Unsloth:         580 tok/s, 42 GB VRAM (fits in one A100, no multi-GPU needed)

# Speedup varies by:
# - GPU architecture: Ampere (30xx, A100) > Turing (20xx, T4) for Unsloth
# - Sequence length: longer sequences benefit more from Unsloth's attention tiling
# - Batch size: larger batches reduce Unsloth's relative advantage slightly
SECTION 06

Supported models

# As of early 2025, Unsloth supports:
SUPPORTED = [
    "meta-llama/Llama-3.1-{8B,70B}-Instruct",
    "meta-llama/Llama-3.2-{1B,3B}-Instruct",
    "Qwen/Qwen2.5-{7B,14B,32B,72B}-Instruct",
    "mistralai/Mistral-7B-Instruct-v0.3",
    "microsoft/Phi-3-{mini,medium}-128k-instruct",
    "google/gemma-2-{2b,9b,27b}-it",
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-{7B,32B}",
    "unsloth/mistral-7b-v0.3-bnb-4bit",  # pre-quantised for faster loading
]

# Load any supported model:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen2.5-7B-Instruct-bnb-4bit",   # unsloth's pre-quantised versions
    max_seq_length=8192,
    load_in_4bit=True,
)

# Check if a model is supported:
# https://github.com/unslothai/unsloth/wiki#-supported-models
# If not in the list, use standard TRL instead — Unsloth will error on unsupported architectures
SECTION 07

Gotchas

Unsloth requires CUDA — no MPS (Apple Silicon) or CPU training. Unsloth's kernels are CUDA-specific. If you're on a Mac or a CPU-only machine, use standard TRL + PEFT instead. For Apple Silicon, consider MLX-LM which has native optimisations for M-series chips.

use_gradient_checkpointing="unsloth" not True. Unsloth has its own gradient checkpointing implementation that's faster than PyTorch's. Always pass the string "unsloth" rather than the boolean True — using True enables standard PyTorch gradient checkpointing which is slower and may cause errors with Unsloth's kernels.

lora_dropout=0 is intentional in Unsloth benchmarks. Unsloth benchmarks are run with dropout=0 for maximum speed. For production fine-tuning where you risk overfitting (small dataset, many epochs), add dropout=0.05. The speed difference is modest.

Unsloth Speed and Memory Optimizations

Unsloth dramatically accelerates fine-tuning of open-weight models through hand-written CUDA kernels for the attention and MLP operations that dominate training time. By replacing HuggingFace Transformers' standard PyTorch implementations with optimized kernels, Unsloth achieves 2–5× faster training with 60–80% less VRAM usage, making fine-tuning of large models accessible on consumer-grade hardware.

OptimizationTechniqueSpeedupMemory Saving
Custom attention kernelsFused CUDA kernels2–3×30–40%
Gradient checkpointingRecompute activationsSlight slowdown60%+
4-bit model loadingNF4 + double quantNeutral75%
LoRA rank optimizationSmart rank selectionNeutral5–15%
Batch packingNo padding wasteUp to 2×Neutral

Unsloth's batch packing feature eliminates the token padding waste that occurs when training examples have variable lengths. Standard batching pads short sequences to the length of the longest sequence in the batch, wasting compute and memory on padding tokens that contribute nothing to learning. Batch packing instead concatenates multiple short sequences into a single packed sequence that fills the context window, separated by end-of-sequence tokens. This can double effective throughput for datasets with high length variance.

Unsloth supports direct export to GGUF format after fine-tuning, enabling a seamless pipeline from training to local deployment with llama.cpp or Ollama. The export process applies quantization during conversion, eliminating a separate quantization step. Models exported in Q4_K_M format from Unsloth are immediately runnable with Ollama, making the path from custom fine-tuned model to local deployment significantly shorter than the standard HuggingFace → GPTQ/GGUF conversion pipeline.

Unsloth's native support for the Llama, Mistral, and Qwen model families covers the architectures most commonly fine-tuned for production applications. The custom CUDA kernels are model-family-specific, meaning that adding support for a new architecture requires writing new kernels. For models not natively supported, Unsloth falls back to standard HuggingFace Transformers training, still benefiting from 4-bit quantization and memory optimizations but without the CUDA kernel speedups. Checking the Unsloth compatibility matrix before selecting a base model for fine-tuning avoids discovering unsupported architectures mid-project.

Unsloth's chat template support handles the specific conversation format requirements of different model families during supervised fine-tuning. Applying the wrong chat template — using Llama 3's format for a Mistral model, or vice versa — is a silent error that produces a correctly-formed loss but trains the model to generate responses prefixed with wrong role tokens, causing inference failures that manifest as malformed outputs rather than obvious errors. Unsloth's get_chat_template utility selects the correct template from the model name and applies it consistently during dataset preparation.

Evaluation during Unsloth fine-tuning uses the standard HuggingFace Trainer evaluation loop, supporting evaluation datasets, metric computation callbacks, and checkpoint saving based on evaluation loss. For generation tasks where token-level loss is a poor proxy for output quality — instruction following, creative writing, code generation — supplementing the default evaluation with a generation quality callback that samples model outputs and scores them against a reference set provides a more meaningful early stopping signal and checkpoint selection criterion.

Unsloth's memory efficiency improvements compound with LoRA to enable fine-tuning configurations that would be impossible with standard tooling. Training a LoRA adapter for a 70B model with Unsloth on a single 80GB A100 is feasible; the same configuration with standard HuggingFace Trainer requires at least 160GB of GPU memory. This 2× memory reduction is not from a single optimization but from the combination of 4-bit base model loading, custom attention kernels that reduce activation memory, gradient checkpointing, and 8-bit optimizer states all working together.