YAML-configured fine-tuning wrapper for HuggingFace models. Define your dataset, PEFT method, and training settings in one config file. Great for rapid experimentation without writing training loop code.
Axolotl is an open-source fine-tuning framework that wraps HuggingFace Transformers with a clean YAML configuration interface. Instead of writing training loop code, you define everything β model, dataset, PEFT config, optimiser, scheduler β in a single YAML file and run accelerate launch -m axolotl.cli.train config.yaml. It supports LoRA, QLoRA, IAΒ³, prefix tuning, and full fine-tuning; most popular dataset formats; and distributed training via FSDP and DeepSpeed. It's particularly popular for fine-tuning Llama and Mistral models quickly.
# config.yaml β fine-tune Llama 3 8B with QLoRA on a custom dataset
base_model: meta-llama/Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
load_in_4bit: true
strict: false
datasets:
- path: my_dataset.jsonl # local file or HuggingFace dataset name
type: alpaca # dataset format (see below)
dataset_prepared_path: ./prepared_data
val_set_size: 0.05
output_dir: ./fine-tuned-llama3
sequence_len: 4096
sample_packing: true # pack short samples to fill context window
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true # apply LoRA to all linear layers
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_bnb_8bit # 8-bit Adam for memory savings
lr_scheduler: cosine
learning_rate: 0.0002
train_on_inputs: false # only compute loss on assistant turns
group_by_length: true
bf16: true
logging_steps: 10
save_strategy: epoch
Axolotl supports multiple dataset formats via the type field:
instruction, input, output fields{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]}{"input": "...", "output": "..."} β minimal format# Convert your data to alpaca format
import json
data = [
{"instruction": "Summarise this article.", "input": "...", "output": "..."},
]
with open("my_dataset.jsonl", "w") as f:
for item in data:
f.write(json.dumps(item) + "\n")
# Install
pip install axolotl[flash-attn,deepspeed]
# Download a starter config
wget https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/examples/llama-3/qlora.yaml
# Edit the config (model, dataset, output_dir)
# Then train:
accelerate launch -m axolotl.cli.train qlora.yaml
# Monitor with W&B (add to config):
# wandb_project: my-finetune
# wandb_run_id: run-001
# After training, merge LoRA weights into base model:
python -m axolotl.cli.merge_lora qlora.yaml \
--lora_model_dir ./fine-tuned-llama3/checkpoint-500
# Merged model saved to ./fine-tuned-llama3/merged
# Add to config for multi-GPU with FSDP:
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: false
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
# Or use DeepSpeed ZeRO-3:
deepspeed: deepspeed_configs/zero3.json
# Run on 4 GPUs:
accelerate launch --num_processes 4 -m axolotl.cli.train config.yaml
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "./fine-tuned-llama3/checkpoint-500")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8B-Instruct")
# Or use the merged model (no PeftModel needed):
# model = AutoModelForCausalLM.from_pretrained("./fine-tuned-llama3/merged", ...)
messages = [{"role": "user", "content": "Summarise this article: ..."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
attention_mask correctly. Disable it if you see unexpected outputs.train_on_inputs: true by mistake, the model learns to predict the instruction too, which typically hurts performance.PeftModel.from_pretrained() β no Axolotl dependency at inference time.Axolotl's YAML configuration system covers all aspects of the fine-tuning pipeline in a single file, from data loading and model loading through training hyperparameters and evaluation settings. Understanding the interaction between key configuration fields prevents common training failures: setting gradient_accumulation_steps Γ micro_batch_size to match the effective batch size, correctly specifying the sequence_len to match the model's maximum context, and configuring the appropriate special tokens for the dataset format in use.
| Config field | Purpose | Common values |
|---|---|---|
| adapter | PEFT method | qlora, lora, full (none) |
| sequence_len | Max token length per example | 2048, 4096, 8192 |
| micro_batch_size | Per-GPU batch size | 1β8 (depends on VRAM) |
| gradient_accumulation_steps | Effective batch multiplier | 4β32 |
| lora_r | LoRA rank | 8β128 |
| num_epochs | Training epochs | 1β5 for instruction tuning |
Axolotl's multi-pack sequence packing efficiently fills each training sequence to the configured sequence_len by concatenating multiple short examples separated by EOS tokens. This increases GPU utilization by eliminating padding waste β without packing, short instruction-following examples padded to 2048 tokens waste 80β95% of each sequence's compute. Packing requires setting the correct EOS token to prevent cross-example attention leakage, which Axolotl handles automatically when the tokenizer's EOS token is configured correctly.
Axolotl's configuration flexibility enables sophisticated multi-stage fine-tuning strategies where each stage optimizes for different objectives. A common pattern: stage 1 performs supervised fine-tuning (SFT) on high-quality curated data to teach task-specific behavior, stage 2 applies preference learning (DPO or ORPO) on a smaller preference dataset to align outputs with quality criteria, and stage 3 performs continued pretraining on domain-specific unlabeled text to refresh general knowledge without catastrophic forgetting. Each stage uses different datasets, batch sizes, and learning rate schedules. Axolotl enables this by supporting multiple training phases in a single configuration file, with each phase specifying its data, loss function, and optimizer settings. Practitioners report that this pipeline delivers 10-30% better downstream performance than single-stage SFT alone, especially when the preference data emphasizes domains where the base model performs poorly.
Real-world fine-tuning datasets are heterogeneous: some data might be code, some conversational, some structured. Axolotl supports mixing multiple datasets with per-dataset loss weights, sample mixing strategies, and dataset-specific prompt templates. This allows training a single model on math problems (via Chain of Thought prompting), customer support queries (via few-shot examples), and code snippets (via marker-based formatting) simultaneously. The configuration language lets users specify: which datasets to sample from, their relative frequencies, and which chat template to apply to each. During training, Axolotl constructs batches by interleaving samples from different sources according to configured weights. This approach is more sample-efficient than concatenating datasets naively because models learn to adapt their output format based on input context rather than memorizing a single task structure.
Axolotl provides first-class support for parameter-efficient fine-tuning via LoRA (Low-Rank Adaptation), reducing trainable parameters from millions to thousands. LoRA works by freezing the original weights W and training low-rank perturbations: the model's output becomes Wx + ABx where A and B are small learned matrices. This reduces memory and computational cost significantly: a 7B parameter model fine-tuned with LoRA (rank 64) has only ~4.2M trainable parameters instead of 7B. Axolotl's configuration supports specifying which layers to apply LoRA to, learning rates separate from base model, and post-training merging of LoRA weights back into the base model. For production, LoRA enables cost-effective task specialization: train 10 specialized LoRA adapters for different domains on a single base model, then load the appropriate adapter at inference time, a workflow that traditional full fine-tuning makes prohibitive.
Axolotl's configuration flexibility enables sophisticated multi-stage fine-tuning strategies where each stage optimizes for different objectives. A common pattern: stage 1 performs supervised fine-tuning (SFT) on high-quality curated data to teach task-specific behavior, stage 2 applies preference learning (DPO or ORPO) on a smaller preference dataset to align outputs with quality criteria, and stage 3 performs continued pretraining on domain-specific unlabeled text to refresh general knowledge without catastrophic forgetting. Each stage uses different datasets, batch sizes, and learning rate schedules. Axolotl enables this by supporting multiple training phases in a single configuration file, with each phase specifying its data, loss function, and optimizer settings. Practitioners report that this pipeline delivers 10-30% better downstream performance than single-stage SFT alone, especially when the preference data emphasizes domains where the base model performs poorly.