01 — Motivation
Why Synthetic Data?
Human annotation is slow (weeks), expensive ($1–10/sample), and hard to scale to rare capabilities. Synthetic data takes a different approach: generate examples with a strong model (GPT-4, Claude), filter for quality, then train a weaker model on the filtered outputs.
This is the distillation insight: a small model trained on GPT-4 outputs can match GPT-4 on narrow tasks. Alpaca, Vicuna, and WizardLM all use this approach, achieving competitive quality with minimal human labeling.
⚠️
Model collapse risk: Training on synthetic data from a model trained on synthetic data degrades quality over rounds. Hallucinations propagate, distribution shift increases. Always validate against real-world held-out data before declaring success.
The Core Risks
- Model collapse: Successive generations on synthetic data reduce diversity and increase errors
- Hallucination propagation: Errors in the generator are amplified in downstream models
- Distribution shift: Synthetic data distribution diverges from real human data over time
- Bias amplification: Whatever biases exist in the generator model get amplified
02 — Method
Self-Instruct and Instruction Tuning Data
Self-Instruct (Wang et al. 2022): Use the model itself to generate (instruction, input, output) triples starting from a small seed set of 175 human-written examples. The pipeline is simple but effective:
The Self-Instruct Pipeline
- Seed tasks: Start with ~175 manually-written instruction examples
- Generate: Use few-shot prompting to generate new instruction-output pairs
- Filter: Remove duplicates and low-quality examples
- Outputs: Generate responses to keep
- Fine-tune: Train the model on the full dataset
Alpaca (Stanford): Applied Self-Instruct to GPT-3.5, generating 52K instruction-following examples. Fine-tuned LLaMA-7B on these outputs and achieved surprisingly good instruction-following capability.
Evol-Instruct (WizardLM): Iteratively rewrite instructions to be more complex and diverse using evolving prompts. Add constraints, deepen reasoning, concretize details, increase reasoning steps. This creates progressively harder examples that train the model on harder tasks.
Self-Instruct Generation Prompt
SYSTEM: You are an instruction generator. Given examples of (instruction, output) pairs,
generate 20 new diverse instructions on different topics. Each instruction should be:
- A task a language model could complete
- Different from existing instructions
- Between 1-3 sentences
- Varied in format (questions, imperatives, fill-in-the-blank)
EXISTING EXAMPLES:
1. "Classify the sentiment of this review: {review}"
2. "Write a Python function that {task}"
3. "Translate this sentence to French: {text}"
Generate 20 new instructions:
Instruction Dataset Comparison
| Dataset | Size | Generator | Method | License |
| Alpaca | 52K | text-davinci-003 | Self-Instruct | Non-commercial |
| WizardLM-Evol | 250K | GPT-4 | Evol-Instruct | Non-commercial |
| Orca | 5M | GPT-4 + GPT-3.5 | Explanation traces | Non-commercial |
| Magpie | 1M+ | Llama-3 (self) | Pre-query generation | Apache 2.0 |
| Capybara | 16K | GPT-4 | Multi-turn, diverse | Apache 2.0 |
03 — Technique
Distillation Pipelines
Knowledge distillation for LLMs: Train a small student model to mimic a large teacher model's outputs. Hard distillation keeps only final answers (Alpaca-style). Soft distillation uses token probability distributions (better but requires white-box access). Speculative distillation generates diverse outputs, filters with a reward model, and trains on the best ones.
Distillation Types
- Hard distillation: Student trains on teacher's final answers. Simple, but loses intermediate reasoning steps.
- Soft distillation: Student trains on teacher's token probability distributions. Better signal but requires white-box access.
- Speculative distillation: Teacher generates diverse outputs, filter with reward model, train on best ones. Higher quality data.
Distillation Pipeline with Filtering
import openai, json
from tqdm import tqdm
def generate_training_pair(instruction: str, teacher="gpt-4o") -> dict:
response = openai.chat.completions.create(
model=teacher,
messages=[
{"role": "system", "content": "You are a helpful assistant. Be thorough and accurate."},
{"role": "user", "content": instruction}
],
temperature=0.7
)
return {"instruction": instruction, "output": response.choices[0].message.content}
def quality_filter(pair: dict, min_length=50, max_length=2000) -> bool:
output = pair["output"]
if len(output) < min_length or len(output) > max_length:
return False
if output.startswith("I cannot") or output.startswith("I'm sorry"):
return False # filter refusals
return True
dataset = [generate_training_pair(instr) for instr in instructions]
filtered = [p for p in dataset if quality_filter(p)]
print(f"Kept {len(filtered)}/{len(dataset)} pairs ({100*len(filtered)/len(dataset):.0f}%)")
04 — Filtering
Rejection Sampling and Filtering
Rejection Sampling Fine-Tuning (RFT): Generate N outputs per prompt, keep only those that pass a correctness check, train on kept outputs. For math and code, you can verify correctness cheaply (run unit tests, check answer keys). Use this to filter synthetic data.
Rejection sampling is the gold standard for code and math tasks. DeepSeek-R1 and similar reasoning models bootstrapped their training data this way.
Filtering Methods by Task Type
| Task | Verification method | Rejection rate | Quality gain |
| Math | Check against answer key | 60–80% | High |
| Code | Run unit tests | 40–70% | High |
| Instruction following | LLM judge | 20–40% | Medium |
| Creative writing | Reward model | 20–50% | Medium |
| Factual QA | Retrieval + NLI check | 30–60% | Medium |
✓
For code and math, rejection sampling with execution-based verification is the highest-quality synthetic data pipeline available. It's how DeepSeek-R1 and similar models bootstrapped their reasoning data. High rejection rates are expected and necessary for quality.
Deduplication
Always deduplicate synthetic data:
- Exact-match dedup: Hash outputs, remove identical ones
- Near-dedup: MinHash LSH for fuzzy matching
- Embedding-based clustering: Cluster similar examples, keep one per cluster
Duplicates hurt more than they help — they waste training capacity and create false confidence in patterns.
05 — Automation
Magpie and Self-Play Methods
Magpie (Xu et al. 2024): Instead of generating responses to human instructions, prompt the model in pre-query position and let it generate both the instruction AND response. No seed data needed. Works because instruction-tuned models learn to generate user-like queries when given the system prompt plus an empty human turn.
Self-play: Two model instances play adversarial roles (teacher and student, examiner and examinee) to generate challenging data without human involvement.
Constitutional AI data generation: Model critiques its own response using a set of principles, then rewrites to better follow them. Generates (critique, revision) pairs for training.
Magpie-Style Pre-Query Generation
# Give model the assistant prefix to trigger instruction generation
messages = [
{"role": "system", "content": "You are a helpful assistant."},
# Empty human turn — model generates what a user would ask
]
# With pre-fill trick (Anthropic):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
messages=[{"role": "user", "content": ""}], # empty triggers pre-query
system="You are a helpful assistant.",
max_tokens=200
)
Why Magpie Works
- No seed data: Start generation from an empty state
- Automatic diversity: Model naturally generates varied instructions
- Scale: Generate millions of examples cheaply
- Self-consistency: Each instruction-response pair comes from the same model, high internal consistency
06 — Application
Domain-Specific Synthetic Data
General instruction data is widely available (Alpaca, WizardLM). The real value is domain-specific data you can generate for your use case. Collect domain documents, generate Q&A pairs and reasoning chains, filter and deduplicate, then fine-tune.
Domain Adaptation Pipeline
- Collect: Domain documents (papers, contracts, filings, code, support tickets)
- Generate: Q&A pairs and reasoning chains from documents
- Filter: Remove hallucinations and off-topic content
- Deduplicate: Remove near-duplicates
- Fine-tune: Train model on domain data
Domain Strategies by Sector
| Domain | Source material | Generation method | Verification |
| Medical | PubMed abstracts | Extract + rephrase QA | Expert review sample |
| Code | GitHub repos | Problem → solution pairs | Unit tests |
| Customer support | Support tickets | Rephrase + resolve | CSAT proxy |
| Legal | Court rulings | Clause extraction + QA | Lawyer review sample |
| Finance | Earnings reports | Analyst-style summaries | Factual check |
🔒
PII scrubbing: Never include PII (names, emails, account numbers) in synthetic training data, even if it appears in source documents. Scrub before generation, not after. This protects privacy and prevents model memorization of sensitive data.
07 — Validation
Quality Evaluation and Model Collapse Prevention
Quality metrics for synthetic data: perplexity on held-out real data (should not increase), task accuracy on real benchmarks, human evaluation of 1–5% samples. Model collapse is detected by measuring n-gram diversity and topic coverage — each successive round should maintain or increase diversity.
Prevention Strategies
🔄 Data Mixing
- Mix 20–80% synthetic with real data
- Never train purely synthetic for multiple rounds
- Real data anchors quality
🎨 Diversity Forcing
- Require diversity in topics, formats, lengths, difficulty
- Low diversity accelerates collapse
- Measure n-gram coverage in generation
✔️ Round-Trip Consistency
- Generate → verify with different model → keep consistent
- Cross-model verification catches hallucinations
- Reduces error propagation
📈 Iterative Refinement
- Generate → fine-tune → use for harder examples
- Each round focuses on model's failure modes
- Proportional gains diminish after 3–4 rounds
Quality Metrics
- Perplexity: Measure on held-out real data. Should not increase after training on synthetic.
- Benchmark accuracy: Evaluate on real, public benchmarks. Synthetic should not degrade performance.
- Human spot-check: Sample 1–5% of synthetic data and have humans evaluate quality.
- Diversity metrics: Track n-gram coverage and topic diversity across generation rounds.
Tools and Frameworks
08 — Further Reading
References
Academic Papers
-
Paper
Wang, Y. et al. (2022).
Self-Instruct: Aligning Language Models with Self-Generated Instructions.
arXiv:2212.10560. —
arxiv:2212.10560 ↗
-
Paper
Xu, C. et al. (2024).
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing.
arXiv:2406.08464. —
arxiv:2406.08464 ↗
-
Paper
Li, X. et al. (2023).
Evol-Instruct: Evolution of Instruction Tuning for Language Models.
WizardLM. arXiv:2304.12244. —
arxiv:2304.12244 ↗
-
Paper
Wen, Z. et al. (2023).
Rejection Sampling Fine-Tuning for LLMs.
arXiv:2308.01825. —
arxiv:2308.01825 ↗
Documentation & Guides
Blog & Articles