Data · Annotation

Data Labeling for AI

Human annotation, weak supervision, and LLM-assisted labeling — workflows, quality control, and tooling

5 approaches
6 sections
8 tools
Contents
  1. Labeling approaches
  2. Task types for LLMs
  3. Annotation guidelines
  4. Quality control
  5. LLM-assisted labeling
  6. Workflow design
  7. Tools & platforms
  8. References
01 — Methods

Labeling Approaches

ApproachQualityCostSpeedBest for
Human expertVery highVery highSlowCritical, niche domains
CrowdsourcingMedium–highMediumFastLarge volume, clear tasks
Weak supervisionMediumLowVery fastRules-based labeling
LLM-assistedMedium–highLow–mediumFastPre-labeling, bootstrapping
Model-in-the-loopHighMediumMediumActive learning scenarios
💡 Quality hierarchy: Expert > model-in-loop > crowdsourcing > LLM-assisted > weak supervision. Cost goes opposite direction. Combine multiple approaches: weak supervision → LLM pre-label → expert review → model refinement.
02 — LLM-Specific Tasks

Task Types for LLMs

Classification

Label examples into predefined categories: sentiment, topic, intent. Simple to annotate; clear ground truth. Examples: "Is this review positive or negative?" (binary), "Classify support ticket into: billing, technical, general" (multi-class).

Instruction Following

Provide a task, model responds, human judges if response follows instructions. Used for SFT: "Write a haiku about programming" → model → human: "Good" or "Bad". Harder than classification; requires judgment.

Preference Pairs

Show two model outputs A and B, human picks the better one. No absolute quality; only relative ranking. Used for DPO and RLHF. Faster than absolute scoring; harder to calibrate.

RLHF Comparison

Gather preference data specifically for RLHF training. Questions with multiple response options (A vs B vs C). Annotators rank them. Requires clear preference definitions: helpfulness, factuality, safety.

Span Labeling

Mark named entities, key phrases, or sensitive information within text. Used for information extraction, redaction. Requires precise boundary marking; higher disagreement than classification.

03 — Best Practices

Annotation Guidelines

Label Schema Design

Start with the taxonomy: What are all possible labels? Make them mutually exclusive and collectively exhaustive. Example: sentiment should be {positive, negative, neutral}, not {happy, sad, angry} (overlapping and incomplete).

Clear Definitions

For each label, write: definition, when to use it, when NOT to use it, examples. Ambiguous labels cause low inter-annotator agreement (IAA). Example: "Spam = unsolicited marketing or scams. NOT product recommendations requested by user."

Edge Cases & Examples

Document tricky cases: sarcasm, multilingual, multiple interpretations. Include 3–5 worked examples per label. Show reasoning: "This is label X because..."

Calibration Sessions

Monthly meetings: review disagreed examples, align on standards, update guidelines. Prevents annotator drift (interpretations shifting over time).

⚠️ Annotation drift: Annotators naturally reinterpret guidelines over weeks. Monthly calibration sessions prevent this. Have lead annotators re-label 50 old examples monthly; if agreement drops > 5%, retrain all annotators.
04 — Quality Control

Quality Control

Inter-Annotator Agreement (IAA)

Measure consensus. Have 3 annotators label same examples. Calculate Cohen's kappa (binary), Fleiss' kappa (multi-annotator), or Krippendorff's alpha. Target: kappa > 0.80 (substantial agreement). < 0.60 means guidelines are unclear.

Python Example: Cohen's Kappa

from sklearn.metrics import cohen_kappa_score annotator_1 = [0, 1, 1, 0, 1] # 0=neg, 1=pos annotator_2 = [0, 0, 1, 0, 1] # predictions kappa = cohen_kappa_score(annotator_1, annotator_2) print(f"Cohen's kappa: {kappa:.2f}")

Dispute Resolution

When 3 annotators disagree (1-1-1 split), expert adjudicates. Document decision; update guidelines if this pattern repeats.

Gold Standard Sets

Expert-labeled reference examples. Test each annotator against gold set monthly. If accuracy < 85%, retrain or remove annotator.

Auditing Pipelines

Randomly sample 100 labeled examples weekly. Have lead annotator review; flag errors. Track error rate by annotator, task, and date. Trigger retraining if error rate > 5%.

05 — LLM-Assisted Labeling

LLM-Assisted Labeling

Use Claude/GPT-4 to pre-label examples → humans review → train on reviewed data. Trades cost for quality: LLM often gets 70–85% right, humans catch the errors. Net result: 3–5× faster annotation with minimal quality loss if humans are thorough.

Workflow

1

Design LLM Prompt — capture nuance

Embed label definitions, examples, edge cases into the prompt. Prompt quality directly impacts pre-label accuracy. Test on 20 examples first.

Python · LLM-assisted labeling pipeline with confidence filtering
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class Label(BaseModel):
    label: str
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str

def auto_label(text: str, label_options: list[str],
               task_description: str) -> Label:
    """Auto-label text using LLM with confidence score."""
    options_str = ", ".join(f'"{l}"' for l in label_options)
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content":
            f"Task: {task_description}
"
            f"Text: {text}
"
            f"Choose from: {options_str}
"
            f"Give label, confidence (0-1), and one-line reasoning."}],
        response_format=Label,
        temperature=0.0
    )
    return result.choices[0].message.parsed

def label_dataset(texts: list[str], labels: list[str],
                  task: str, confidence_threshold: float = 0.85) -> dict:
    """Label dataset, routing low-confidence items to human review."""
    auto_labeled, human_queue = [], []
    for text in texts:
        result = auto_label(text, labels, task)
        if result.confidence >= confidence_threshold:
            auto_labeled.append({"text": text, "label": result.label,
                                  "confidence": result.confidence})
        else:
            human_queue.append({"text": text, "predicted": result.label,
                                 "confidence": result.confidence})
    print(f"Auto-labeled: {len(auto_labeled)} | Human review: {len(human_queue)}")
    return {"auto": auto_labeled, "human_queue": human_queue}

# Example
result = label_dataset(
    texts=["Great product, very happy!", "Terrible service, never again."],
    labels=["positive", "negative", "neutral"],
    task="Classify customer review sentiment"
)
Python · Inter-annotator agreement and quality control
from collections import Counter
import statistics

def cohens_kappa(labels_a: list, labels_b: list) -> float:
    """Compute Cohen's Kappa — agreement corrected for chance."""
    assert len(labels_a) == len(labels_b)
    n = len(labels_a)
    categories = set(labels_a) | set(labels_b)

    # Observed agreement
    p_o = sum(a == b for a, b in zip(labels_a, labels_b)) / n

    # Expected agreement by chance
    p_e = sum(
        (labels_a.count(cat) / n) * (labels_b.count(cat) / n)
        for cat in categories
    )
    return (p_o - p_e) / (1 - p_e) if p_e < 1 else 1.0

def majority_vote(label_sets: list[list]) -> list:
    """Aggregate multiple annotators via majority vote."""
    return [Counter(labels).most_common(1)[0][0] for labels in zip(*label_sets)]

def annotation_stats(label_sets: list[list]) -> dict:
    """Compute agreement stats across annotators."""
    # Pairwise kappa
    kappas = []
    for i in range(len(label_sets)):
        for j in range(i+1, len(label_sets)):
            kappas.append(cohens_kappa(label_sets[i], label_sets[j]))

    final_labels = majority_vote(label_sets)
    # Confusion rate: fraction of items where annotators disagree
    confusion = sum(
        len(set(item)) > 1 for item in zip(*label_sets)
    ) / len(label_sets[0])

    return {
        "mean_kappa": round(statistics.mean(kappas), 3),
        "min_kappa": round(min(kappas), 3),
        "confusion_rate": round(confusion, 3),
        "interpretation": ("good" if statistics.mean(kappas) > 0.6
                          else "moderate" if statistics.mean(kappas) > 0.4
                          else "poor — review guidelines")
    }

# Example with 3 annotators
a1 = ["pos", "neg", "pos", "neu", "neg"]
a2 = ["pos", "neg", "neu", "neu", "neg"]
a3 = ["pos", "neg", "pos", "pos", "neg"]
print(annotation_stats([a1, a2, a3]))
2

Pre-label Batch — generate candidates

Run LLM on all unlabeled examples. Save output. Usually fast and cheap (Haiku preferable for cost).

3

Human Review — correct & validate

Show humans LLM label + example. They accept, reject, or modify. Faster than labeling from scratch; quality is high if humans are careful.

4

Calibration — track accuracy

Have LLM label 100 gold-standard examples. Calculate accuracy. If < 75%, iterate on prompt. If > 85%, proceed to full batch.

Prompt Design Tips

Be specific: "Classify as positive, negative, or neutral" vs "Is this good?". Give examples: Show label + reasoning for 2–3 examples. Set constraints: "Output only the label, no explanation" reduces hallucination. Handle uncertainty: "If unclear, output 'UNCLEAR'" — flag for manual review.

💡 Cost vs accuracy tradeoff: Haiku (cheap, ~70% accuracy) + human review beats Opus alone (expensive, ~90% accuracy) if labor is available. Math: 10K examples × $0.0005/Haiku + 2K corrections × $0.05 human = $105. vs 10K × $0.005/Opus = $50. But Haiku path is more accurate after human review.
06 — Workflow Design

Workflow Design

Task Decomposition

Break complex tasks into simpler subtasks. Instead of "label sentiment & offensive language", have separate tasks. Easier to train, higher agreement, easier to outsource.

Batching & Assignment

Assign 20–50 examples per batch to one annotator. Let them build context; reduces errors. Assign same batch to 2 annotators for spot-checking (every 100 examples).

Annotator Management

Track annotator productivity and quality. Create profiles: some annotators fast but loose (data augmentation), others slow but thorough (gold standard). Use data accordingly.

Version Control for Labels

Store labels with metadata: annotator ID, timestamp, version number. If guidelines change, re-label old data with new version. Track label lineage.

Iterative Refinement

After collecting 1000 examples, run basic model. Evaluate on test set. Identify low-confidence examples; send back to annotators for clarification. Refine guidelines based on model failures.

⚠️ Common pitfall: Labeling 100K examples all at once, then discovering guidelines were unclear or data was biased. Instead: label 1K → evaluate → refine → label 5K → evaluate → scale to 100K. Iteration is faster and cheaper overall.
07 — Ecosystem

Tools & Platforms

Label Studio

Open-source annotation tool. Web UI, API, multi-task support. Self-hosted. Good for in-house teams.

Scale AI

Managed labeling service. Expert annotators, quality guarantees, fast turnaround. Expensive but high quality.

Argilla

Open-source data labeling & curation. Model-in-the-loop, active learning, weak supervision.

Prodigy

Annotation tool with active learning. Iterative labeling; model suggests uncertain examples.

Labelbox

Enterprise labeling platform. Data versioning, review workflows, quality metrics built-in.

Snorkel

Weak supervision framework. Write labeling functions instead of labels; programmatic data generation.

CVAT

Computer vision annotation tool. Boxes, polygons, segmentation. Extensible; works for text too.

Cleanlab

Data quality platform. Identifies mislabeled examples; ranks data by confidence for active learning.

08 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing