DATA ENGINEERING

Synthetic Data for LLMs

Generating training data with models — self-instruct, distillation, rejection sampling, and when synthetic beats real

self-instruct → distillation → filtering the generation pipeline
quality > quantity the golden rule
10–100× cheaper than human labels the cost argument
Contents
  1. Why synthetic data?
  2. Self-Instruct
  3. Distillation pipelines
  4. Rejection sampling
  5. Magpie & self-play
  6. Domain-specific data
  7. Quality & collapse
01 — Motivation

Why Synthetic Data?

Human annotation is slow (weeks), expensive ($1–10/sample), and hard to scale to rare capabilities. Synthetic data takes a different approach: generate examples with a strong model (GPT-4, Claude), filter for quality, then train a weaker model on the filtered outputs.

This is the distillation insight: a small model trained on GPT-4 outputs can match GPT-4 on narrow tasks. Alpaca, Vicuna, and WizardLM all use this approach, achieving competitive quality with minimal human labeling.

⚠️ Model collapse risk: Training on synthetic data from a model trained on synthetic data degrades quality over rounds. Hallucinations propagate, distribution shift increases. Always validate against real-world held-out data before declaring success.

The Core Risks

02 — Method

Self-Instruct and Instruction Tuning Data

Self-Instruct (Wang et al. 2022): Use the model itself to generate (instruction, input, output) triples starting from a small seed set of 175 human-written examples. The pipeline is simple but effective:

The Self-Instruct Pipeline

  1. Seed tasks: Start with ~175 manually-written instruction examples
  2. Generate: Use few-shot prompting to generate new instruction-output pairs
  3. Filter: Remove duplicates and low-quality examples
  4. Outputs: Generate responses to keep
  5. Fine-tune: Train the model on the full dataset

Alpaca (Stanford): Applied Self-Instruct to GPT-3.5, generating 52K instruction-following examples. Fine-tuned LLaMA-7B on these outputs and achieved surprisingly good instruction-following capability.

Evol-Instruct (WizardLM): Iteratively rewrite instructions to be more complex and diverse using evolving prompts. Add constraints, deepen reasoning, concretize details, increase reasoning steps. This creates progressively harder examples that train the model on harder tasks.

Self-Instruct Generation Prompt

SYSTEM: You are an instruction generator. Given examples of (instruction, output) pairs, generate 20 new diverse instructions on different topics. Each instruction should be: - A task a language model could complete - Different from existing instructions - Between 1-3 sentences - Varied in format (questions, imperatives, fill-in-the-blank) EXISTING EXAMPLES: 1. "Classify the sentiment of this review: {review}" 2. "Write a Python function that {task}" 3. "Translate this sentence to French: {text}" Generate 20 new instructions:

Instruction Dataset Comparison

DatasetSizeGeneratorMethodLicense
Alpaca52Ktext-davinci-003Self-InstructNon-commercial
WizardLM-Evol250KGPT-4Evol-InstructNon-commercial
Orca5MGPT-4 + GPT-3.5Explanation tracesNon-commercial
Magpie1M+Llama-3 (self)Pre-query generationApache 2.0
Capybara16KGPT-4Multi-turn, diverseApache 2.0
03 — Technique

Distillation Pipelines

Knowledge distillation for LLMs: Train a small student model to mimic a large teacher model's outputs. Hard distillation keeps only final answers (Alpaca-style). Soft distillation uses token probability distributions (better but requires white-box access). Speculative distillation generates diverse outputs, filters with a reward model, and trains on the best ones.

Distillation Types

Distillation Pipeline with Filtering

import openai, json from tqdm import tqdm def generate_training_pair(instruction: str, teacher="gpt-4o") -> dict: response = openai.chat.completions.create( model=teacher, messages=[ {"role": "system", "content": "You are a helpful assistant. Be thorough and accurate."}, {"role": "user", "content": instruction} ], temperature=0.7 ) return {"instruction": instruction, "output": response.choices[0].message.content} def quality_filter(pair: dict, min_length=50, max_length=2000) -> bool: output = pair["output"] if len(output) < min_length or len(output) > max_length: return False if output.startswith("I cannot") or output.startswith("I'm sorry"): return False # filter refusals return True dataset = [generate_training_pair(instr) for instr in instructions] filtered = [p for p in dataset if quality_filter(p)] print(f"Kept {len(filtered)}/{len(dataset)} pairs ({100*len(filtered)/len(dataset):.0f}%)")
04 — Filtering

Rejection Sampling and Filtering

Rejection Sampling Fine-Tuning (RFT): Generate N outputs per prompt, keep only those that pass a correctness check, train on kept outputs. For math and code, you can verify correctness cheaply (run unit tests, check answer keys). Use this to filter synthetic data.

Rejection sampling is the gold standard for code and math tasks. DeepSeek-R1 and similar reasoning models bootstrapped their training data this way.

Filtering Methods by Task Type

TaskVerification methodRejection rateQuality gain
MathCheck against answer key60–80%High
CodeRun unit tests40–70%High
Instruction followingLLM judge20–40%Medium
Creative writingReward model20–50%Medium
Factual QARetrieval + NLI check30–60%Medium
For code and math, rejection sampling with execution-based verification is the highest-quality synthetic data pipeline available. It's how DeepSeek-R1 and similar models bootstrapped their reasoning data. High rejection rates are expected and necessary for quality.

Deduplication

Always deduplicate synthetic data:

Duplicates hurt more than they help — they waste training capacity and create false confidence in patterns.

05 — Automation

Magpie and Self-Play Methods

Magpie (Xu et al. 2024): Instead of generating responses to human instructions, prompt the model in pre-query position and let it generate both the instruction AND response. No seed data needed. Works because instruction-tuned models learn to generate user-like queries when given the system prompt plus an empty human turn.

Self-play: Two model instances play adversarial roles (teacher and student, examiner and examinee) to generate challenging data without human involvement.

Constitutional AI data generation: Model critiques its own response using a set of principles, then rewrites to better follow them. Generates (critique, revision) pairs for training.

Magpie-Style Pre-Query Generation

# Give model the assistant prefix to trigger instruction generation messages = [ {"role": "system", "content": "You are a helpful assistant."}, # Empty human turn — model generates what a user would ask ] # With pre-fill trick (Anthropic): response = client.messages.create( model="claude-3-5-sonnet-20241022", messages=[{"role": "user", "content": ""}], # empty triggers pre-query system="You are a helpful assistant.", max_tokens=200 )

Why Magpie Works

06 — Application

Domain-Specific Synthetic Data

General instruction data is widely available (Alpaca, WizardLM). The real value is domain-specific data you can generate for your use case. Collect domain documents, generate Q&A pairs and reasoning chains, filter and deduplicate, then fine-tune.

Domain Adaptation Pipeline

  1. Collect: Domain documents (papers, contracts, filings, code, support tickets)
  2. Generate: Q&A pairs and reasoning chains from documents
  3. Filter: Remove hallucinations and off-topic content
  4. Deduplicate: Remove near-duplicates
  5. Fine-tune: Train model on domain data

Domain Strategies by Sector

DomainSource materialGeneration methodVerification
MedicalPubMed abstractsExtract + rephrase QAExpert review sample
CodeGitHub reposProblem → solution pairsUnit tests
Customer supportSupport ticketsRephrase + resolveCSAT proxy
LegalCourt rulingsClause extraction + QALawyer review sample
FinanceEarnings reportsAnalyst-style summariesFactual check
🔒 PII scrubbing: Never include PII (names, emails, account numbers) in synthetic training data, even if it appears in source documents. Scrub before generation, not after. This protects privacy and prevents model memorization of sensitive data.
07 — Validation

Quality Evaluation and Model Collapse Prevention

Quality metrics for synthetic data: perplexity on held-out real data (should not increase), task accuracy on real benchmarks, human evaluation of 1–5% samples. Model collapse is detected by measuring n-gram diversity and topic coverage — each successive round should maintain or increase diversity.

Prevention Strategies

🔄 Data Mixing

  • Mix 20–80% synthetic with real data
  • Never train purely synthetic for multiple rounds
  • Real data anchors quality

🎨 Diversity Forcing

  • Require diversity in topics, formats, lengths, difficulty
  • Low diversity accelerates collapse
  • Measure n-gram coverage in generation

✔️ Round-Trip Consistency

  • Generate → verify with different model → keep consistent
  • Cross-model verification catches hallucinations
  • Reduces error propagation

📈 Iterative Refinement

  • Generate → fine-tune → use for harder examples
  • Each round focuses on model's failure modes
  • Proportional gains diminish after 3–4 rounds

Quality Metrics

Tools and Frameworks

Data Generation
distilabel
By Argilla. Synthetic data pipelines for instruction tuning.
Data Generation
Magpie
Pre-query generation without seed data.
Framework
LLM Foundry
Composer/MosaicML. End-to-end training stack.
Training
Axolotl
Fine-tuning framework with synthetic data support.
Training
TRL (HF)
Transformers Reinforcement Learning library.
Data
HuggingFace Datasets
Streaming and batching for large datasets.
API
OpenAI Batch API
Cheap, asynchronous generation at scale.
Framework
LangChain
Synthetic data generation pipelines.
08 — Further Reading

References

Academic Papers
Documentation & Guides
Blog & Articles