Data Engineering

Self-Instruct / Evol

Data generation techniques (Self-Instruct, WizardLM Evol-Instruct) that use an LLM to iteratively create and complexify instruction datasets for fine-tuning.

Self-Instruct seed
175 examples → 52K
Evol-Instruct
in-breadth + in-depth
Output quality
GPT-4 judged

Table of Contents

SECTION 01

Self-Instruct Algorithm

Self-Instruct (Wang et al. 2022) generates instruction data in four steps: (1) Sample 8 instructions from a pool (initially 175 seed instructions). (2) Decide if the instruction needs an input field. (3) Generate the output for the instruction (and input if applicable). (4) Filter: remove duplicates (ROUGE-L > 0.7 with any existing instruction) and add valid examples to the pool. Iterate until the desired dataset size is reached.

SECTION 02

Evol-Instruct

WizardLM's Evol-Instruct (Xu et al. 2023) improves Self-Instruct by evolving existing instructions to be more complex. Two evolution types: In-depth: add constraints, deepen complexity, add reasoning steps, concretise. In-breadth: generate a completely new instruction on a related topic. The result is a dataset with graduated difficulty — mixing simple and complex instructions.

SECTION 03

WizardCoder Pipeline

Evol-Instruct applied to code: evolve simple code instructions into complex ones. " "Example evolution: 'Write a function to sort a list' → " "'Write a memory-efficient in-place merge sort that handles duplicates and returns " "the count of swaps performed, with O(1) extra space.'

EVOL_PROMPT = """ I want you to act as an instruction evolver. Given an existing instruction, " make it more complex using one of these methods: 1. Add more constraints or requirements 2. Increase the depth of the task 3. Add a reasoning or explanation requirement 4. Specialise to a specific domain or edge case

Original: {instruction} Evolved: """

def evolve_instruction(instruction: str, model: str = "gpt-4o") -> str:
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": EVOL_PROMPT.format(instruction=instruction)}],
        temperature=0.9,
    )
    return resp.choices[0].message.content.strip()
SECTION 04

Orca Technique

Microsoft's Orca (2023) improves on Self-Instruct by including rich system prompts and chain-of-thought reasoning in the generated responses. The teacher (GPT-4) is prompted to 'think step by step' and 'explain your reasoning', producing detailed explanations that the student model learns from. This transferred reasoning capability rather than just answer patterns.

SECTION 05

Implementation

A minimal Evol-Instruct pipeline:

import random
def evol_instruct_pipeline(
    seed_instructions: list[str],
    n_rounds: int = 3,
    n_per_round: int = 1000,
) -> list[dict]:
    pool = seed_instructions.copy()
    dataset = []
for round_n in range(n_rounds):
        new_instructions = []
        sample = random.sample(pool, min(n_per_round, len(pool)))
        for inst in sample:
            evolved = evolve_instruction(inst)
            # Generate response
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": evolved}],
            ).choices[0].message.content
            new_instructions.append(evolved)
            dataset.append({"instruction": evolved, "output": resp, "round": round_n})
        pool.extend(new_instructions)
        print(f"Round {round_n+1}: {len(dataset)} total examples")
    return dataset
SECTION 06

Quality Evaluation

Evaluate generated datasets with: LLM-as-judge scoring (GPT-4 rates quality 1–5), task diversity analysis (embedding cluster coverage), difficulty distribution (what fraction requires multi-step reasoning?), and downstream model performance (fine-tune and eval on benchmark). The best signal is always the downstream evaluation — a beautiful-looking dataset that doesn't improve model performance is worthless.

SECTION 07

Best Practices & Common Pitfalls

When implementing Self-Instruct or Evol-Instruct, seed diversity is critical. Seeds that are too similar lead to homogeneous outputs; seeds that span multiple task types (classification, generation, QA, reasoning) produce richer datasets. Use topic diversity, task type variation, and complexity gradation when selecting your initial seed pool. Monitor the ROUGE-L deduplication scores — if your similarity threshold is too loose, you'll accumulate near-duplicates that don't add signal to the training set.

TechniqueSeed SizeIterationsOutput QualityCost
Self-Instruct1754–6GoodLow
Evol-Instruct100–5003–5Very GoodMedium
Orca200–1K1–2ExcellentHigh
WizardCoder300–1K2–4Very GoodMedium–High
def deduplicate_instructions(instructions: list[str], threshold: float = 0.7):
    """Remove near-duplicates using ROUGE-L similarity."""
    from rouge_score import rouge_scorer
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    unique = []
    for inst in instructions:
        is_dup = False
        for unique_inst in unique:
            score = scorer.score(inst, unique_inst)['rougeL'].fmeasure
            if score > threshold:
                is_dup = True
                break
        if not is_dup:
            unique.append(inst)
    return unique
SECTION 08

Scaling Considerations

Scaling self-instruction beyond a few thousand examples reveals that generation quality often degrades after 5–10 iterations. The model begins to overgeneralize its own patterns, and errors accumulate. To mitigate this: (1) Retrain the generator model on curated subsets periodically, (2) Use an auxiliary LLM-as-judge to score candidates before adding to the pool, (3) Implement diversity filters (e.g., embedding-based clustering) to prevent topic collapse, (4) Combine multiple evolution strategies rather than repeating one method. Large-scale campaigns show that a hybrid approach — starting with Self-Instruct for breadth, then Evol-Instruct for depth — outperforms pure iteration.

The success of self-instruction critically depends on the initial seed quality and diversity. If your 175 seed instructions are all classification tasks, the model will generate thousands of classification variants and never produce generation, summarization, or reasoning tasks. Diversification strategies: (1) stratify seeds by task type (10–15% QA, 10–15% classification, 10–15% generation, etc.), (2) include outliers and edge cases (adversarial inputs, multi-step reasoning), (3) vary instruction complexity (simple and complex seeds produce mixed-difficulty outputs). Teams at scale (Meta, Anthropic) use curriculum learning for seed selection: start with simple, high-confidence instructions, gradually introduce harder ones. This prevents early generations from being too simple or too noisy.

Iteration dynamics reveal interesting patterns: early iterations (rounds 1–2) produce high-quality outputs because the model stays close to seed patterns; middle iterations (rounds 3–4) inject creativity but accumulate errors; late iterations (rounds 5+) often become brittle. The quality-diversity frontier shows that you can optimize for either but not both: pure diversity maximization (new task types, rare patterns) produces low-quality data; pure quality maximization (conservative filtering) produces homogeneous data. Production pipelines use ensemble approaches: run multiple independent pipelines with different seeds, combine results, and deduplicate. This reduces error accumulation and improves generalization.

Downstream validation is crucial. After generating a large dataset, always evaluate on a held-out benchmark or via human raters. A 100K-example Self-Instruct dataset that doesn't improve model performance is worthless—worse, it might introduce biases or spurious correlations. Use this as a feedback signal to refine your generation strategy. Some teams found that Orca-style reasoning traces (step-by-step explanations) were necessary for factual accuracy; others found that diverse task types mattered more than explanation quality. These insights are task and model specific—what works for GPT-3.5 fine-tuning might not work for Llama.

Orca's key insight was that reasoning quality matters more than answer correctness alone. When GPT-4 is prompted with "think step by step," it produces detailed explanations of why an answer is correct. Smaller models (Orca-7B, Orca-13B) trained on these explanations learn to mimic not just the output but the reasoning process. This transferred reasoning ability is particularly valuable for complex tasks: the model doesn't just know the answer but understands how to arrive at it. Evaluation shows that Orca significantly outperforms models trained on simple instruction-output pairs, especially on reasoning-heavy benchmarks. The lesson: invest in training data quality. One high-quality Orca example with reasoning traces might be worth 10 simple instruction-output pairs. This principle extends to other domains: coding (include comments explaining the logic), summarization (include key points explaining the summary), translation (include back-translation or quality scores).

A practical consideration: scaling instruction generation hits diminishing returns. The first 10K examples provide large improvements; the next 100K provide smaller gains per example. At some point, you're better off spending effort on different approaches: finding better base models, tweaking training procedures, or collecting domain-specific data. Some organizations found that 30K high-quality Evol-Instruct examples outperformed 500K Self-Instruct examples. This suggests quality curves are steep. Portfolio approach: generate different types of data (Self-Instruct for breadth, Evol-Instruct for depth, Orca for reasoning) and combine them. This diversity often beats homogeneous scale. In production, monitor the downstream metrics (model performance on benchmarks) and stop generating more data when returns plateau—compute spent on generation isn't spent on training or inference where you get value.

The economics of instruction generation is important. Each call to GPT-4 costs money; generating 100K examples might cost thousands of dollars. At that scale, efficiency matters. Reuse annotations: if you've already generated similar examples, deduplicate before labeling more. Batch generation (generate 100 examples per call) is cheaper than per-example generation. Use cheaper models for initial generation (GPT-3.5), then use GPT-4 only for refinement or filtering. Filtering is crucial: a well-chosen 10K examples beats a poorly chosen 100K. Use LLM-as-judge to score candidates before adding to the dataset. A two-stage pipeline (generate 100K candidates, filter to 10K) is often cheaper and better than a single-stage 10K generation. For production systems, these economics are essential: the cost of data is part of the model development cost, and should be optimized like any other resource.