Axolotl

What is Axolotl
Minimal config example
Dataset formats
QLoRA fine-tuning walkthrough
Multi-GPU training
Inference with trained model
Gotchas

SECTION 01

What is Axolotl

Axolotl is an open-source fine-tuning framework that wraps HuggingFace Transformers with a clean YAML configuration interface. Instead of writing training loop code, you define everything — model, dataset, PEFT config, optimiser, scheduler — in a single YAML file and run accelerate launch -m axolotl.cli.train config.yaml. It supports LoRA, QLoRA, IA³, prefix tuning, and full fine-tuning; most popular dataset formats; and distributed training via FSDP and DeepSpeed. It's particularly popular for fine-tuning Llama and Mistral models quickly.

SECTION 02

Minimal config example

# config.yaml — fine-tune Llama 3 8B with QLoRA on a custom dataset
base_model: meta-llama/Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
strict: false

datasets:
  - path: my_dataset.jsonl      # local file or HuggingFace dataset name
    type: alpaca                # dataset format (see below)

dataset_prepared_path: ./prepared_data
val_set_size: 0.05
output_dir: ./fine-tuned-llama3

sequence_len: 4096
sample_packing: true           # pack short samples to fill context window

adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true       # apply LoRA to all linear layers

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit      # 8-bit Adam for memory savings
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false         # only compute loss on assistant turns
group_by_length: true
bf16: true

logging_steps: 10
save_strategy: epoch

SECTION 03

Dataset formats

Axolotl supports multiple dataset formats via the type field:

alpaca: JSON with instruction, input, output fields
sharegpt: Conversations as {"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}
chat_template: Uses the model's native chat template — best for instruct fine-tuning
completion: Raw text, no structure — for pre-training style fine-tuning
input_output: Simple {"input": "...", "output": "..."} — minimal format

# Convert your data to alpaca format
import json

data = [
    {"instruction": "Summarise this article.", "input": "...", "output": "..."},
]
with open("my_dataset.jsonl", "w") as f:
    for item in data:
        f.write(json.dumps(item) + "\n")

SECTION 04

QLoRA fine-tuning walkthrough

# Install
pip install axolotl[flash-attn,deepspeed]

# Download a starter config
wget https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/llama-3/qlora.yaml

# Edit the config (model, dataset, output_dir)
# Then train:
accelerate launch -m axolotl.cli.train qlora.yaml

# Monitor with W&B (add to config):
# wandb_project: my-finetune
# wandb_run_id: run-001

# After training, merge LoRA weights into base model:
python -m axolotl.cli.merge_lora qlora.yaml \
    --lora_model_dir ./fine-tuned-llama3/checkpoint-500
# Merged model saved to ./fine-tuned-llama3/merged

SECTION 05

Multi-GPU training

# Add to config for multi-GPU with FSDP:
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: false
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

# Or use DeepSpeed ZeRO-3:
deepspeed: deepspeed_configs/zero3.json

# Run on 4 GPUs:
accelerate launch --num_processes 4 -m axolotl.cli.train config.yaml

SECTION 06

Inference with trained model

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./fine-tuned-llama3/checkpoint-500")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")

# Or use the merged model (no PeftModel needed):
# model = AutoModelForCausalLM.from_pretrained("./fine-tuned-llama3/merged", ...)

messages = [{"role": "user", "content": "Summarise this article: ..."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

SECTION 07

Gotchas

sample_packing with padding: Sample packing (concatenating short sequences) dramatically improves GPU utilisation. But it can cause issues with attention masks if your model doesn't support the attention_mask correctly. Disable it if you see unexpected outputs.
train_on_inputs: false: For instruction fine-tuning, you only want loss on assistant turns. If you set train_on_inputs: true by mistake, the model learns to predict the instruction too, which typically hurts performance.
Checkpoint compatibility: Axolotl checkpoints are standard HuggingFace PEFT checkpoints. You can load them outside of Axolotl with PeftModel.from_pretrained() — no Axolotl dependency at inference time.

Axolotl training configuration reference

Axolotl's YAML configuration system covers all aspects of the fine-tuning pipeline in a single file, from data loading and model loading through training hyperparameters and evaluation settings. Understanding the interaction between key configuration fields prevents common training failures: setting gradient_accumulation_steps × micro_batch_size to match the effective batch size, correctly specifying the sequence_len to match the model's maximum context, and configuring the appropriate special tokens for the dataset format in use.

Config field	Purpose	Common values
adapter	PEFT method	qlora, lora, full (none)
sequence_len	Max token length per example	2048, 4096, 8192
micro_batch_size	Per-GPU batch size	1–8 (depends on VRAM)
gradient_accumulation_steps	Effective batch multiplier	4–32
lora_r	LoRA rank	8–128
num_epochs	Training epochs	1–5 for instruction tuning

Axolotl's multi-pack sequence packing efficiently fills each training sequence to the configured sequence_len by concatenating multiple short examples separated by EOS tokens. This increases GPU utilization by eliminating padding waste — without packing, short instruction-following examples padded to 2048 tokens waste 80–95% of each sequence's compute. Packing requires setting the correct EOS token to prevent cross-example attention leakage, which Axolotl handles automatically when the tokenizer's EOS token is configured correctly.

Multi-stage fine-tuning pipelines

Axolotl's configuration flexibility enables sophisticated multi-stage fine-tuning strategies where each stage optimizes for different objectives. A common pattern: stage 1 performs supervised fine-tuning (SFT) on high-quality curated data to teach task-specific behavior, stage 2 applies preference learning (DPO or ORPO) on a smaller preference dataset to align outputs with quality criteria, and stage 3 performs continued pretraining on domain-specific unlabeled text to refresh general knowledge without catastrophic forgetting. Each stage uses different datasets, batch sizes, and learning rate schedules. Axolotl enables this by supporting multiple training phases in a single configuration file, with each phase specifying its data, loss function, and optimizer settings. Practitioners report that this pipeline delivers 10-30% better downstream performance than single-stage SFT alone, especially when the preference data emphasizes domains where the base model performs poorly.

Handling heterogeneous training data

Real-world fine-tuning datasets are heterogeneous: some data might be code, some conversational, some structured. Axolotl supports mixing multiple datasets with per-dataset loss weights, sample mixing strategies, and dataset-specific prompt templates. This allows training a single model on math problems (via Chain of Thought prompting), customer support queries (via few-shot examples), and code snippets (via marker-based formatting) simultaneously. The configuration language lets users specify: which datasets to sample from, their relative frequencies, and which chat template to apply to each. During training, Axolotl constructs batches by interleaving samples from different sources according to configured weights. This approach is more sample-efficient than concatenating datasets naively because models learn to adapt their output format based on input context rather than memorizing a single task structure.

Efficient LoRA and adapter training

Axolotl provides first-class support for parameter-efficient fine-tuning via LoRA (Low-Rank Adaptation), reducing trainable parameters from millions to thousands. LoRA works by freezing the original weights W and training low-rank perturbations: the model's output becomes Wx + ABx where A and B are small learned matrices. This reduces memory and computational cost significantly: a 7B parameter model fine-tuned with LoRA (rank 64) has only ~4.2M trainable parameters instead of 7B. Axolotl's configuration supports specifying which layers to apply LoRA to, learning rates separate from base model, and post-training merging of LoRA weights back into the base model. For production, LoRA enables cost-effective task specialization: train 10 specialized LoRA adapters for different domains on a single base model, then load the appropriate adapter at inference time, a workflow that traditional full fine-tuning makes prohibitive.

Axolotl

Table of Contents

What is Axolotl

Minimal config example

Dataset formats

QLoRA fine-tuning walkthrough

Multi-GPU training

Inference with trained model

Gotchas

Axolotl training configuration reference

Multi-stage fine-tuning pipelines

Handling heterogeneous training data

Efficient LoRA and adapter training

Multi-stage fine-tuning pipelines

Related concepts