Distillation

What Is Distillation?
Teacher Data Generation
Training the Student
Quality vs Cost Tradeoffs
Speculative Decoding Link
Production Recipe

SECTION 01

What Is Distillation?

Distillation (Hinton 2015) trains a small student model to match the outputs of a large teacher model. For LLMs, the most practical form is behavioural cloning: generate high-quality (query, response) pairs from the teacher, then fine-tune the student on those pairs. The student learns the teacher's reasoning style without having the teacher's scale.

SECTION 02

Teacher Data Generation

Use the teacher (GPT-4o, Claude Opus) to generate responses to your task-specific queries. " "Key considerations: diverse query coverage, rejection sampling (keep only teacher responses that pass quality checks), " "and sufficient volume (10K minimum, 100K for strong results).

from openai import OpenAI
import json, tqdm
client = OpenAI()
def generate_teacher_data(queries: list[str], output_file: str):
    dataset = []
    for q in tqdm.tqdm(queries):
        resp = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": q}],
            temperature=0.3,
        )
        answer = resp.choices[0].message.content
        if passes_quality_check(answer):  # length, format, coherence
            dataset.append({"prompt": q, "completion": answer})
    with open(output_file, "w") as f:
        for ex in dataset:
            f.write(json.dumps(ex) + "\n")
    print(f"Saved {len(dataset)} examples")

SECTION 03

Training the Student

Fine-tune a small model (Llama 3.2 3B, Phi-3 mini) on the teacher data using standard SFT. " "Use LoRA for parameter efficiency. Learning rate 2e-5, 3 epochs, cosine schedule. " "Eval on a held-out 10% of the teacher-generated data + your golden test set.

from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
lora_config = LoraConfig(r=16, lora_alpha=32, target_modules=["q_proj","v_proj"])
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    peft_config=lora_config,
    max_seq_length=2048,
)
trainer.train()

SECTION 04

Quality vs Cost Tradeoffs

A distilled 7B model typically reaches 85–90% of a 70B teacher's quality at 10× lower inference cost. For narrow tasks (customer support, code completion in a specific framework), distillation often reaches 95%+ because the domain is well-represented in the training data. General-purpose distillation is harder — use task-specific distillation when possible.

SECTION 05

Speculative Decoding Link

Distilled small models are excellent draft models for speculative decoding with the original large teacher. The student generates candidate tokens cheaply; the teacher verifies and accepts/rejects. This gives teacher-quality output at near-student-speed — the best of both worlds.

SECTION 06

Production Recipe

1. Identify your highest-volume task (e.g. classification, extraction, summarisation). 2. Generate 50K teacher examples with diversity sampling. 3. Fine-tune a 3B–7B student model with LoRA. 4. A/B test student vs teacher on 5% of live traffic. 5. If quality delta < 5%, fully migrate to student. 6. Use teacher as fallback for low-confidence student outputs.

SECTION 07

Temperature and Knowledge Transfer

Temperature scaling controls the "softness" of the student predictions. High temperature (T > 1.0) makes the teacher output softer and preserves more information about the relative ordering of wrong answers. Low temperature makes predictions sharper. The optimal temperature depends on the task: NLP tasks often use T=4-8, while computer vision uses T=1-3.

import torch.nn.functional as F
def knowledge_distillation_loss(
    student_logits, teacher_logits, labels, T=4.0, alpha=0.5
):
    # Soft target loss: teacher guidance
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / T, dim=1),
        F.softmax(teacher_logits / T, dim=1),
        reduction='batchmean'
    ) * (T ** 2)
    
    # Hard target loss: ground truth
    hard_loss = F.cross_entropy(student_logits, labels)
    
    return alpha * soft_loss + (1 - alpha) * hard_loss

Distillation is particularly effective for large language models where the student can be 5-100x smaller. In practice: a 3B parameter student trained on a 70B teacher can match 60-70% of teacher quality on many benchmarks. The choice of teacher matters: ensemble teachers, domain-specific teachers, or fine-tuned teachers all produce better students than smaller general-purpose teachers.

# Ensemble teacher distillation
teacher_models = [
    load_model("teacher_v1"),
    load_model("teacher_v2"),
    load_model("teacher_v3")
]
def ensemble_teacher_logits(inputs):
    logits_list = [m(inputs) for m in teacher_models]
    return torch.stack(logits_list).mean(dim=0)

student = StudentModel(hidden_size=512)
for batch in train_loader:
    teacher_logits = ensemble_teacher_logits(batch['input'])
    student_logits = student(batch['input'])
    loss = knowledge_distillation_loss(
        student_logits, teacher_logits, batch['labels'], T=8
    )

Configuration	Student Size	Quality vs Teacher	Inference Speed
Light Distillation	10% of teacher	70-75%	10-20x faster
Medium Distillation	25% of teacher	85-90%	4-8x faster
Heavy Distillation	50% of teacher	95-98%	2-3x faster

SECTION 08

Open-source distillation frameworks: Hugging Face Transformers provides distillation examples for BERT, Llama, and other models. TinyLlama is a fully-distilled 1.1B model that retains 70% of Llama 7B quality through aggressive distillation. DistilBERT is BERT distilled to 40% of parameters with 97% of quality. These models demonstrate that distillation is not just academically interesting—it's a practical tool for production.

Distillation + quantization: combine distillation with INT8 quantization for 10-50x speedups. A distilled 350M model quantized to INT8 can run inference at 5-10x the speed of a full-size teacher. Acceptable quality drops are 2-5%. This is the strategy behind mobile-optimized models: distill down from a large teacher, quantize aggressively, and deploy on phones. The final model is 50-100MB instead of several GB.

Deployment and Production Monitoring

Distilled models require slightly different deployment strategies than standard models. The key metric is distillation fidelity: how well does the student match the teacher on held-out data? In production, track both student accuracy and the student-teacher divergence. If divergence increases (student starts disagreeing with teacher), it indicates model drift or distribution shift.

Continuous distillation: retrain the student on the latest data while keeping the teacher fixed. This avoids the chicken-and-egg problem of updating both simultaneously. Deploy new students every week or month. A/B test students against the baseline teacher to confirm quality gains before full rollout.

Online distillation: as production queries arrive, use them to refine the student. This is especially valuable for personalization tasks where the teacher (expensive model) can give soft targets for queries it sees. Combine batch distillation (for breadth) with online distillation (for personalized quality).

SECTION 09

Choosing the Right Teacher and Convergence

Not all teachers are equally effective. A large but poorly-trained teacher may not help a student. An ensemble of diverse teachers often outperforms a single large teacher—they provide complementary knowledge. Fine-tune the teacher on your specific domain (e.g., medical domain) before distilling: domain-specific teachers teach domain-specific patterns that generalist teachers miss.

Convergence analysis: distillation doesn't always converge smoothly. If the student architecture is too small for the problem, it hits a quality ceiling below the teacher. If the student learns too fast (high learning rate), it may overfit to the teacher's mistakes. Hyperparameter tuning is essential: learning rate, temperature, alpha weight. Use a validation set to monitor student-teacher divergence and stop early if the student starts diverging.

Distillation in production systems: companies like Google, Meta, and Microsoft use distillation for model compression. Google's MobileBERT is distilled for mobile deployment. Microsoft's DistilBERT-MultiLingual handles 104 languages. These are production models used in real applications. Distillation is not just an academic exercise—it's a core technique for making large models deployable.

Distillation variants: response-based distillation (match final outputs), feature-based distillation (match intermediate representations), and relation-based distillation (match relationships between samples). Combine them: response-based for overall quality, feature-based for interpretability, relation-based for robustness. Multi-objective distillation often beats single-objective.

Distillation is proven for production. It trades quality for massive speedup. Always validate student meets quality bar before deploying. Use distillation when balancing quality and latency/cost. The technique is mature and widely adopted across industry for model compression.

Distillation for LLMs: distilling GPT-4 or Claude into smaller models is an active frontier. Companies build distilled versions using synthetic data from larger models. Results are promising: distilled 7B models match 30-40 percent of larger model quality. This technique will dominate model deployment as model sizes grow. Expect more distilled models released publicly as efficient inference value increases with scale.

Key takeaway: the value of this approach compounds over time. In month one, the benefits might be marginal. In month six, dramatically apparent. In year two, transformative. This is why patience and persistence matter in technical implementation. Build strong foundations, invest in quality, and let the benefits accumulate. The teams that master these techniques gain compounding advantages over competitors. Start today, measure continuously, optimize based on data. Success follows from consistent execution of fundamentals.

Getting started with distillation: pick a proven teacher model in your domain. Measure baseline performance. Implement simple knowledge distillation with reasonable hyperparameters. Compare student and teacher on your test set. If quality is acceptable, deploy. Monitor student performance in production. Iterate based on feedback. Distillation is not magic, but it works reliably when properly implemented with domain knowledge and careful testing. This is proven technology deployed at scale. The technique is mature, proven, and widely adopted across industry for model deployment at scale. Distillation unlocks efficiency at massive scale and is essential for modern AI deployment.