Using LLMs to generate synthetic training data — instruction-response pairs, preference data, domain-specific examples — to augment or replace scarce human-annotated datasets.
Human annotation is expensive, slow, and hard to scale. GPT-4o can generate 10,000 instruction-response pairs in hours at a fraction of human annotation cost. Synthetic data works especially well for: narrow domain fine-tuning, format/style standardisation, capability expansion (teaching the model new tasks), and RLAIF preference pair generation.
Start with a small set of seed examples (20–50 high-quality human-written samples). " "Use an LLM to generate variations: paraphrased questions, related questions, " "analogous scenarios. The seed examples anchor quality; the LLM generates diversity.
from openai import OpenAI
import json, random
client = OpenAI()
def generate_variations(seed_examples: list[dict], n_per_seed: int = 10) -> list[dict]:
generated = []
for seed in seed_examples:
prompt = (
f"Here is an example Q&A pair:\n"
f"Q: {seed['question']}\nA: {seed['answer']}\n\n"
f"Generate {n_per_seed} different questions on the same topic with answers. "
f"Return JSON array: [{{question: str, answer: str}}]"
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
)
try:
variations = json.loads(resp.choices[0].message.content)
generated.extend(variations.get("items", variations) if isinstance(variations, dict) else variations)
except Exception:
pass
return generated
Self-Instruct (Wang et al. 2022) bootstraps an instruction dataset from a small seed using only the model itself. The model generates: a new instruction, an input (if applicable), and the output. A filtering step removes near-duplicates and invalid examples. Used to generate Alpaca's 52K instruction dataset from 175 seed examples.
For narrow domains (medical, legal, code), generate domain-specific examples " "with appropriate framing. Provide domain context in the system prompt.
def generate_domain_data(domain: str, task: str, n: int = 100) -> list[dict]:
system_prompt = f"You are an expert in {domain}. Generate realistic {task} examples."
examples = []
for _ in range(n // 5): # batch of 5
resp = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": (
f"Generate 5 {task} examples as JSON array: "
f"[{{instruction: str, response: str}}]"
)}
],
response_format={"type": "json_object"},
)
batch = json.loads(resp.choices[0].message.content)
examples.extend(batch.get("examples", []))
return examples
Temperature variation: generate at temperature 0.7–1.0 for variety, not 0.0. Persona variation: generate the same task from different expert perspectives. Difficulty scaling: explicitly request easy/medium/hard examples in equal proportion. Format variation: mix Q&A, instruction-following, and multi-turn dialogue formats. Monitor diversity with embedding clustering — if all examples cluster tightly, increase temperature.
10,000 high-quality synthetic examples outperform 100,000 low-quality ones. Apply quality filters: minimum response length, no truncation, passes a relevance check ('is this response actually answering the question?'), and no repetition of the instruction in the response. Deduplicate with MinHash after generation — LLMs often produce near-identical examples.
Synthetic data quality requires rigorous evaluation. Key metrics include diversity (Bleu/ROUGE across generated samples), validity (parsing/format compliance), and utility (downstream task performance). Typically 30-40% synthetic data mixed with human examples maintains model quality while reducing annotation costs.
# Synthetic data quality evaluation
def evaluate_synthetic_quality(synthetic_texts, human_texts):
from nltk.translate.bleu_score import corpus_bleu
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'])
# Diversity: BLEU between synthetic samples
diversity_bleu = corpus_bleu(
[[s.split()] for s in synthetic_texts[:50]],
synthetic_texts[50:100]
)
# Quality: comparison to human examples
rouge_scores = [scorer.score(human, syn) for human, syn in zip(human_texts[:100], synthetic_texts[:100])]
return {"diversity_bleu": diversity_bleu, "quality_rouge": rouge_scores}
| Generation Method | Diversity | Quality | Cost per 1K samples |
|---|---|---|---|
| Rule-based Templates | Low | High | $0.10 |
| LLM Instruction-Following | High | Medium | $2.50 |
| Seed-Based Perturbation | Medium | High | $0.50 |
| GAN-based | High | Variable | Compute |
Production systems generate synthetic data through parallel batch processing. Using distributed frameworks like Ray or MapReduce scales generation to billions of samples. Cost optimization strategies include using cheaper models for seed generation and expensive models for filtering/validation, achieving 10-50x cost reduction compared to naive approaches.
Synthetic data generation at scale requires sophisticated engineering. Batch generation systems process thousands of samples in parallel using distributed frameworks like Ray or Kubernetes jobs. Quality filters remove low-quality outputs through automated scoring, reducing downstream annotation burden by 30-50%. Diversity metrics (using embeddings or BLEU scores) prevent mode collapse where generation systems produce repetitive samples despite diverse prompts. Temperature and top-k sampling parameters control generation creativity: temperature=0.3 yields deterministic, focused outputs suitable for instruction following, while temperature=1.0 provides creative variation for dialogue and story generation. Self-supervised pre-training on synthetic data before fine-tuning on human data is an emerging pattern that reduces annotation requirements by 50-70%. Hybrid approaches combining human-generated seeds with model-based expansion achieve cost/quality ratios 10-100x better than pure manual annotation. For domain-specific synthesis (medical, legal, financial), incorporating domain-specific constraints and terminology databases prevents hallucination. Evaluation metrics must combine automated scoring (perplexity, classification accuracy) with human spot checks (5% sample review) to detect systematic failures.
Cost-benefit analysis of synthetic data generation reveals surprising economics. Human annotation costs approximately $30-50 per 1000 samples for general tasks, $100+ for specialized domains (medical, legal). Synthetic generation via open-source models costs $0.50-1.00 per 1000 samples (compute + API). Hybrid approaches combining synthetic + human: generate 90% synthetic, expert review 10% sample reduces costs to $5-10 per 1000 samples (95% cost reduction) while maintaining quality. Quality degradation from pure synthetic data (no human review) typically appears at 50K+ sample scale: models trained on synthetic-only data show 2-5% accuracy drop versus hybrid approaches. The "synthetic data sweet spot" empirically occurs at 30-40% synthetic + 60-70% human data, providing 40-50% cost reduction while maintaining full model quality. Techniques to improve synthetic quality include: temperature control (lower temperature = more conservative, higher = more diverse), diversity sampling (select diverse prompts instead of random), and constraint-based generation (enforce format requirements, topic constraints). Filtering strategies: use classifier trained on human data to score synthetic samples, keep top-50% by score. Self-improve loops: train initial model on synthetic data, use it to generate harder examples for human annotation, retrain. These techniques compound: a 30-40% cost reduction per cycle enables 5-10x cost reduction across development lifecycle.
Domain-specific synthetic data requires careful engineering to avoid garbage output. Medical data generation must avoid hallucinated diagnoses or treatments: use templates based on real medical knowledge, validate against medical ontologies, include clinician review of 5-10% sample. Legal document generation requires careful handling of citations and precedents: few-shot prompting with real examples improves quality significantly. Financial data generation for backtesting must respect market microstructure: correlations, volatility clustering, bid-ask spreads. Code generation for synthetic benchmarks must be executable: use execution feedback to filter, incorporate syntax checking, provide domain-specific libraries in context. E-commerce product descriptions must be internally consistent: same product should have consistent attributes across variations. These domain-specific safeguards add 10-20% overhead but prevent quality disasters where generated data causes models to behave incorrectly. Validation frameworks provide multi-level checking: syntax validation (code compiles, JSON parses), semantic validation (entities mentioned exist, logical consistency), and domain validation (medical facts correct, code follows patterns). Monitoring production systems trained on synthetic data reveals failures often emerge in edge cases not represented in generation process: ensure diverse prompt coverage and dynamic expansion as new patterns discovered.
Scaling synthetic data generation beyond single-machine processing requires distributed batch systems. Generation pipelines using Ray distributed computing: generate 10K samples in parallel across 100 workers (10 samples per worker), aggregate results, validate quality. Kubernetes orchestration for production: maintain GPU-enabled worker pool (K8s GPU nodes), queue generation requests (Kafka or Redis), scale workers based on queue depth. Prompt diversity at scale: avoid mode collapse by varying prompts systematically (different styles, domains, perspectives), rotate through prompt templates, include adversarial prompts (edge cases, boundary conditions). Quality filtering cascades: fast filters first (format/parsing), slow filters later (embedding-based diversity check), remove low-quality quickly. Cost-quality pareto frontier: low quality data (temperature=1.5, minimal filtering) costs $0.10 per 1K samples with low utility. High quality data (temperature=0.3, extensive filtering) costs $5 per 1K samples with high utility. Optimal sweet spot at $0.50-1.00 per 1K samples (moderate quality) achieves 95% of utility at 10x lower cost. Iterative refinement: start with cheap, diverse generation, collect human labels on subset, train quality classifier, use classifier to filter future generations. Feedback loops: model trained on synthetic data makes predictions, hard examples (low confidence) sent to humans for labeling, labels used to improve next synthetic generation. Scale-up patterns: 100K samples easily on single GPU (24 hours), 1M samples requires 10-100 GPU-days, 10M samples requires distributed infrastructure (week on GPU cluster).
| Method | Data needed | Output diversity | Quality risk |
|---|---|---|---|
| Seed expansion (Self-Instruct) | Seed examples only | Medium | Style drift from seeds |
| Persona-based generation | Persona library | High | Persona inconsistency |
| Document-grounded | Source documents | Low (constrained) | Hallucination in answers |
| Evol-Instruct (difficulty) | Existing instructions | Medium | Overcomplicated prompts |