Synthetic Data for LLMs

Contents

Why synthetic data?
Self-Instruct
Distillation pipelines
Rejection sampling
Magpie & self-play
Domain-specific data
Quality & collapse

01 — Motivation

Why Synthetic Data?

Human annotation is slow (weeks), expensive ($1–10/sample), and hard to scale to rare capabilities. Synthetic data takes a different approach: generate examples with a strong model (GPT-4, Claude), filter for quality, then train a weaker model on the filtered outputs.

This is the distillation insight: a small model trained on GPT-4 outputs can match GPT-4 on narrow tasks. Alpaca, Vicuna, and WizardLM all use this approach, achieving competitive quality with minimal human labeling.

⚠️ Model collapse risk: Training on synthetic data from a model trained on synthetic data degrades quality over rounds. Hallucinations propagate, distribution shift increases. Always validate against real-world held-out data before declaring success.

The Core Risks

Model collapse: Successive generations on synthetic data reduce diversity and increase errors
Hallucination propagation: Errors in the generator are amplified in downstream models
Distribution shift: Synthetic data distribution diverges from real human data over time
Bias amplification: Whatever biases exist in the generator model get amplified

02 — Method

Self-Instruct and Instruction Tuning Data

Self-Instruct (Wang et al. 2022): Use the model itself to generate (instruction, input, output) triples starting from a small seed set of 175 human-written examples. The pipeline is simple but effective:

The Self-Instruct Pipeline

Seed tasks: Start with ~175 manually-written instruction examples
Generate: Use few-shot prompting to generate new instruction-output pairs
Filter: Remove duplicates and low-quality examples
Outputs: Generate responses to keep
Fine-tune: Train the model on the full dataset

Alpaca (Stanford): Applied Self-Instruct to GPT-3.5, generating 52K instruction-following examples. Fine-tuned LLaMA-7B on these outputs and achieved surprisingly good instruction-following capability.

Evol-Instruct (WizardLM): Iteratively rewrite instructions to be more complex and diverse using evolving prompts. Add constraints, deepen reasoning, concretize details, increase reasoning steps. This creates progressively harder examples that train the model on harder tasks.

Self-Instruct Generation Prompt

SYSTEM: You are an instruction generator. Given examples of (instruction, output) pairs, generate 20 new diverse instructions on different topics. Each instruction should be: - A task a language model could complete - Different from existing instructions - Between 1-3 sentences - Varied in format (questions, imperatives, fill-in-the-blank) EXISTING EXAMPLES: 1. "Classify the sentiment of this review: {review}" 2. "Write a Python function that {task}" 3. "Translate this sentence to French: {text}" Generate 20 new instructions:

Instruction Dataset Comparison

Dataset	Size	Generator	Method	License
Alpaca	52K	text-davinci-003	Self-Instruct	Non-commercial
WizardLM-Evol	250K	GPT-4	Evol-Instruct	Non-commercial
Orca	5M	GPT-4 + GPT-3.5	Explanation traces	Non-commercial
Magpie	1M+	Llama-3 (self)	Pre-query generation	Apache 2.0
Capybara	16K	GPT-4	Multi-turn, diverse	Apache 2.0

03 — Technique

Distillation Pipelines

Knowledge distillation for LLMs: Train a small student model to mimic a large teacher model's outputs. Hard distillation keeps only final answers (Alpaca-style). Soft distillation uses token probability distributions (better but requires white-box access). Speculative distillation generates diverse outputs, filters with a reward model, and trains on the best ones.

Distillation Types

Hard distillation: Student trains on teacher's final answers. Simple, but loses intermediate reasoning steps.
Soft distillation: Student trains on teacher's token probability distributions. Better signal but requires white-box access.
Speculative distillation: Teacher generates diverse outputs, filter with reward model, train on best ones. Higher quality data.

Distillation Pipeline with Filtering

import openai, json from tqdm import tqdm def generate_training_pair(instruction: str, teacher="gpt-4o") -> dict: response = openai.chat.completions.create( model=teacher, messages=[ {"role": "system", "content": "You are a helpful assistant. Be thorough and accurate."}, {"role": "user", "content": instruction} ], temperature=0.7 ) return {"instruction": instruction, "output": response.choices[0].message.content} def quality_filter(pair: dict, min_length=50, max_length=2000) -> bool: output = pair["output"] if len(output) < min_length or len(output) > max_length: return False if output.startswith("I cannot") or output.startswith("I'm sorry"): return False # filter refusals return True dataset = [generate_training_pair(instr) for instr in instructions] filtered = [p for p in dataset if quality_filter(p)] print(f"Kept {len(filtered)}/{len(dataset)} pairs ({100*len(filtered)/len(dataset):.0f}%)")

04 — Filtering

Rejection Sampling and Filtering

Rejection Sampling Fine-Tuning (RFT): Generate N outputs per prompt, keep only those that pass a correctness check, train on kept outputs. For math and code, you can verify correctness cheaply (run unit tests, check answer keys). Use this to filter synthetic data.

Rejection sampling is the gold standard for code and math tasks. DeepSeek-R1 and similar reasoning models bootstrapped their training data this way.

Filtering Methods by Task Type

Task	Verification method	Rejection rate	Quality gain
Math	Check against answer key	60–80%	High
Code	Run unit tests	40–70%	High
Instruction following	LLM judge	20–40%	Medium
Creative writing	Reward model	20–50%	Medium
Factual QA	Retrieval + NLI check	30–60%	Medium

✓ For code and math, rejection sampling with execution-based verification is the highest-quality synthetic data pipeline available. It's how DeepSeek-R1 and similar models bootstrapped their reasoning data. High rejection rates are expected and necessary for quality.

Deduplication

Always deduplicate synthetic data:

Exact-match dedup: Hash outputs, remove identical ones
Near-dedup: MinHash LSH for fuzzy matching
Embedding-based clustering: Cluster similar examples, keep one per cluster

Duplicates hurt more than they help — they waste training capacity and create false confidence in patterns.

05 — Automation

Magpie and Self-Play Methods

Magpie (Xu et al. 2024): Instead of generating responses to human instructions, prompt the model in pre-query position and let it generate both the instruction AND response. No seed data needed. Works because instruction-tuned models learn to generate user-like queries when given the system prompt plus an empty human turn.

Self-play: Two model instances play adversarial roles (teacher and student, examiner and examinee) to generate challenging data without human involvement.

Constitutional AI data generation: Model critiques its own response using a set of principles, then rewrites to better follow them. Generates (critique, revision) pairs for training.

Magpie-Style Pre-Query Generation

# Give model the assistant prefix to trigger instruction generation messages = [ {"role": "system", "content": "You are a helpful assistant."}, # Empty human turn — model generates what a user would ask ] # With pre-fill trick (Anthropic): response = client.messages.create( model="claude-3-5-sonnet-20241022", messages=[{"role": "user", "content": ""}], # empty triggers pre-query system="You are a helpful assistant.", max_tokens=200 )

Why Magpie Works

No seed data: Start generation from an empty state
Automatic diversity: Model naturally generates varied instructions
Scale: Generate millions of examples cheaply
Self-consistency: Each instruction-response pair comes from the same model, high internal consistency

06 — Application

Domain-Specific Synthetic Data

General instruction data is widely available (Alpaca, WizardLM). The real value is domain-specific data you can generate for your use case. Collect domain documents, generate Q&A pairs and reasoning chains, filter and deduplicate, then fine-tune.

Domain Adaptation Pipeline

Collect: Domain documents (papers, contracts, filings, code, support tickets)
Generate: Q&A pairs and reasoning chains from documents
Filter: Remove hallucinations and off-topic content
Deduplicate: Remove near-duplicates
Fine-tune: Train model on domain data

Domain Strategies by Sector

Domain	Source material	Generation method	Verification
Medical	PubMed abstracts	Extract + rephrase QA	Expert review sample
Code	GitHub repos	Problem → solution pairs	Unit tests
Customer support	Support tickets	Rephrase + resolve	CSAT proxy
Legal	Court rulings	Clause extraction + QA	Lawyer review sample
Finance	Earnings reports	Analyst-style summaries	Factual check

🔒 PII scrubbing: Never include PII (names, emails, account numbers) in synthetic training data, even if it appears in source documents. Scrub before generation, not after. This protects privacy and prevents model memorization of sensitive data.

07 — Validation

Quality Evaluation and Model Collapse Prevention

Quality metrics for synthetic data: perplexity on held-out real data (should not increase), task accuracy on real benchmarks, human evaluation of 1–5% samples. Model collapse is detected by measuring n-gram diversity and topic coverage — each successive round should maintain or increase diversity.

Prevention Strategies

🔄 Data Mixing

Mix 20–80% synthetic with real data
Never train purely synthetic for multiple rounds
Real data anchors quality

🎨 Diversity Forcing

Require diversity in topics, formats, lengths, difficulty
Low diversity accelerates collapse
Measure n-gram coverage in generation

✔️ Round-Trip Consistency

Generate → verify with different model → keep consistent
Cross-model verification catches hallucinations
Reduces error propagation

📈 Iterative Refinement

Generate → fine-tune → use for harder examples
Each round focuses on model's failure modes
Proportional gains diminish after 3–4 rounds

Quality Metrics

Perplexity: Measure on held-out real data. Should not increase after training on synthetic.
Benchmark accuracy: Evaluate on real, public benchmarks. Synthetic should not degrade performance.
Human spot-check: Sample 1–5% of synthetic data and have humans evaluate quality.
Diversity metrics: Track n-gram coverage and topic diversity across generation rounds.

Tools and Frameworks

Data Generation

distilabel

By Argilla. Synthetic data pipelines for instruction tuning.

Data Generation

Magpie

Pre-query generation without seed data.

Framework

LLM Foundry

Composer/MosaicML. End-to-end training stack.

Training

Axolotl

Fine-tuning framework with synthetic data support.

Training

TRL (HF)

Transformers Reinforcement Learning library.

Data

HuggingFace Datasets

Streaming and batching for large datasets.

API

OpenAI Batch API

Cheap, asynchronous generation at scale.

Framework

LangChain

Synthetic data generation pipelines.

08 — Further Reading

References

Academic Papers

Paper Wang, Y. et al. (2022). Self-Instruct: Aligning Language Models with Self-Generated Instructions. arXiv:2212.10560. — arxiv:2212.10560 ↗
Paper Xu, C. et al. (2024). Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. arXiv:2406.08464. — arxiv:2406.08464 ↗
Paper Li, X. et al. (2023). Evol-Instruct: Evolution of Instruction Tuning for Language Models. WizardLM. arXiv:2304.12244. — arxiv:2304.12244 ↗
Paper Wen, Z. et al. (2023). Rejection Sampling Fine-Tuning for LLMs. arXiv:2308.01825. — arxiv:2308.01825 ↗

Documentation & Guides

Docs Argilla distilabel. distilabel.argilla.io ↗
Docs HuggingFace Datasets. huggingface.co/docs/datasets ↗
Docs OpenAI Batch API. platform.openai.com ↗
Guide Axolotl fine-tuning. github.com/axolotl ↗

Blog & Articles

Blog Stanford CRFM. (2023). Alpaca: A Strong, Replicable Instruction-Following Model. — crfm.stanford.edu ↗
Blog HuggingFace. (2024). Synthetic Data: Why, When, and How. — huggingface.co/blog ↗

Synthetic Data for LLMs

Why Synthetic Data?

The Core Risks

Self-Instruct and Instruction Tuning Data

The Self-Instruct Pipeline

Self-Instruct Generation Prompt

Instruction Dataset Comparison

Distillation Pipelines

Distillation Types

Distillation Pipeline with Filtering

Rejection Sampling and Filtering

Filtering Methods by Task Type

Deduplication

Magpie and Self-Play Methods

Magpie-Style Pre-Query Generation

Why Magpie Works

Domain-Specific Synthetic Data

Domain Adaptation Pipeline

Domain Strategies by Sector

Quality Evaluation and Model Collapse Prevention

Prevention Strategies

🔄 Data Mixing

🎨 Diversity Forcing

✔️ Round-Trip Consistency

📈 Iterative Refinement

Quality Metrics

Tools and Frameworks

References

Related concepts