Data Labeling for AI

Contents

Labeling approaches
Task types for LLMs
Annotation guidelines
Quality control
LLM-assisted labeling
Workflow design
Tools & platforms
References

01 — Methods

Labeling Approaches

Approach	Quality	Cost	Speed	Best for
Human expert	Very high	Very high	Slow	Critical, niche domains
Crowdsourcing	Medium–high	Medium	Fast	Large volume, clear tasks
Weak supervision	Medium	Low	Very fast	Rules-based labeling
LLM-assisted	Medium–high	Low–medium	Fast	Pre-labeling, bootstrapping
Model-in-the-loop	High	Medium	Medium	Active learning scenarios

💡 Quality hierarchy: Expert > model-in-loop > crowdsourcing > LLM-assisted > weak supervision. Cost goes opposite direction. Combine multiple approaches: weak supervision → LLM pre-label → expert review → model refinement.

02 — LLM-Specific Tasks

Task Types for LLMs

Classification

Label examples into predefined categories: sentiment, topic, intent. Simple to annotate; clear ground truth. Examples: "Is this review positive or negative?" (binary), "Classify support ticket into: billing, technical, general" (multi-class).

Instruction Following

Provide a task, model responds, human judges if response follows instructions. Used for SFT: "Write a haiku about programming" → model → human: "Good" or "Bad". Harder than classification; requires judgment.

Preference Pairs

Show two model outputs A and B, human picks the better one. No absolute quality; only relative ranking. Used for DPO and RLHF. Faster than absolute scoring; harder to calibrate.

RLHF Comparison

Gather preference data specifically for RLHF training. Questions with multiple response options (A vs B vs C). Annotators rank them. Requires clear preference definitions: helpfulness, factuality, safety.

Span Labeling

Mark named entities, key phrases, or sensitive information within text. Used for information extraction, redaction. Requires precise boundary marking; higher disagreement than classification.

03 — Best Practices

Annotation Guidelines

Label Schema Design

Start with the taxonomy: What are all possible labels? Make them mutually exclusive and collectively exhaustive. Example: sentiment should be {positive, negative, neutral}, not {happy, sad, angry} (overlapping and incomplete).

Clear Definitions

For each label, write: definition, when to use it, when NOT to use it, examples. Ambiguous labels cause low inter-annotator agreement (IAA). Example: "Spam = unsolicited marketing or scams. NOT product recommendations requested by user."

Edge Cases & Examples

Document tricky cases: sarcasm, multilingual, multiple interpretations. Include 3–5 worked examples per label. Show reasoning: "This is label X because..."

Calibration Sessions

Monthly meetings: review disagreed examples, align on standards, update guidelines. Prevents annotator drift (interpretations shifting over time).

⚠️ Annotation drift: Annotators naturally reinterpret guidelines over weeks. Monthly calibration sessions prevent this. Have lead annotators re-label 50 old examples monthly; if agreement drops > 5%, retrain all annotators.

04 — Quality Control

Quality Control

Inter-Annotator Agreement (IAA)

Measure consensus. Have 3 annotators label same examples. Calculate Cohen's kappa (binary), Fleiss' kappa (multi-annotator), or Krippendorff's alpha. Target: kappa > 0.80 (substantial agreement). < 0.60 means guidelines are unclear.

Python Example: Cohen's Kappa

from sklearn.metrics import cohen_kappa_score annotator_1 = [0, 1, 1, 0, 1] # 0=neg, 1=pos annotator_2 = [0, 0, 1, 0, 1] # predictions kappa = cohen_kappa_score(annotator_1, annotator_2) print(f"Cohen's kappa: {kappa:.2f}")

Dispute Resolution

When 3 annotators disagree (1-1-1 split), expert adjudicates. Document decision; update guidelines if this pattern repeats.

Gold Standard Sets

Expert-labeled reference examples. Test each annotator against gold set monthly. If accuracy < 85%, retrain or remove annotator.

Auditing Pipelines

Randomly sample 100 labeled examples weekly. Have lead annotator review; flag errors. Track error rate by annotator, task, and date. Trigger retraining if error rate > 5%.

05 — LLM-Assisted Labeling

LLM-Assisted Labeling

Use Claude/GPT-4 to pre-label examples → humans review → train on reviewed data. Trades cost for quality: LLM often gets 70–85% right, humans catch the errors. Net result: 3–5× faster annotation with minimal quality loss if humans are thorough.

Workflow

Design LLM Prompt — capture nuance

Embed label definitions, examples, edge cases into the prompt. Prompt quality directly impacts pre-label accuracy. Test on 20 examples first.

Python · LLM-assisted labeling pipeline with confidence filtering

from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class Label(BaseModel):
    label: str
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str

def auto_label(text: str, label_options: list[str],
               task_description: str) -> Label:
    """Auto-label text using LLM with confidence score."""
    options_str = ", ".join(f'"{l}"' for l in label_options)
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content":
            f"Task: {task_description}
"
            f"Text: {text}
"
            f"Choose from: {options_str}
"
            f"Give label, confidence (0-1), and one-line reasoning."}],
        response_format=Label,
        temperature=0.0
    )
    return result.choices[0].message.parsed

def label_dataset(texts: list[str], labels: list[str],
                  task: str, confidence_threshold: float = 0.85) -> dict:
    """Label dataset, routing low-confidence items to human review."""
    auto_labeled, human_queue = [], []
    for text in texts:
        result = auto_label(text, labels, task)
        if result.confidence >= confidence_threshold:
            auto_labeled.append({"text": text, "label": result.label,
                                  "confidence": result.confidence})
        else:
            human_queue.append({"text": text, "predicted": result.label,
                                 "confidence": result.confidence})
    print(f"Auto-labeled: {len(auto_labeled)} | Human review: {len(human_queue)}")
    return {"auto": auto_labeled, "human_queue": human_queue}

# Example
result = label_dataset(
    texts=["Great product, very happy!", "Terrible service, never again."],
    labels=["positive", "negative", "neutral"],
    task="Classify customer review sentiment"
)

Python · Inter-annotator agreement and quality control

from collections import Counter
import statistics

def cohens_kappa(labels_a: list, labels_b: list) -> float:
    """Compute Cohen's Kappa — agreement corrected for chance."""
    assert len(labels_a) == len(labels_b)
    n = len(labels_a)
    categories = set(labels_a) | set(labels_b)

    # Observed agreement
    p_o = sum(a == b for a, b in zip(labels_a, labels_b)) / n

    # Expected agreement by chance
    p_e = sum(
        (labels_a.count(cat) / n) * (labels_b.count(cat) / n)
        for cat in categories
    )
    return (p_o - p_e) / (1 - p_e) if p_e < 1 else 1.0

def majority_vote(label_sets: list[list]) -> list:
    """Aggregate multiple annotators via majority vote."""
    return [Counter(labels).most_common(1)[0][0] for labels in zip(*label_sets)]

def annotation_stats(label_sets: list[list]) -> dict:
    """Compute agreement stats across annotators."""
    # Pairwise kappa
    kappas = []
    for i in range(len(label_sets)):
        for j in range(i+1, len(label_sets)):
            kappas.append(cohens_kappa(label_sets[i], label_sets[j]))

    final_labels = majority_vote(label_sets)
    # Confusion rate: fraction of items where annotators disagree
    confusion = sum(
        len(set(item)) > 1 for item in zip(*label_sets)
    ) / len(label_sets[0])

    return {
        "mean_kappa": round(statistics.mean(kappas), 3),
        "min_kappa": round(min(kappas), 3),
        "confusion_rate": round(confusion, 3),
        "interpretation": ("good" if statistics.mean(kappas) > 0.6
                          else "moderate" if statistics.mean(kappas) > 0.4
                          else "poor — review guidelines")
    }

# Example with 3 annotators
a1 = ["pos", "neg", "pos", "neu", "neg"]
a2 = ["pos", "neg", "neu", "neu", "neg"]
a3 = ["pos", "neg", "pos", "pos", "neg"]
print(annotation_stats([a1, a2, a3]))

Pre-label Batch — generate candidates

Run LLM on all unlabeled examples. Save output. Usually fast and cheap (Haiku preferable for cost).

Human Review — correct & validate

Show humans LLM label + example. They accept, reject, or modify. Faster than labeling from scratch; quality is high if humans are careful.

Calibration — track accuracy

Have LLM label 100 gold-standard examples. Calculate accuracy. If < 75%, iterate on prompt. If > 85%, proceed to full batch.

Prompt Design Tips

Be specific: "Classify as positive, negative, or neutral" vs "Is this good?". Give examples: Show label + reasoning for 2–3 examples. Set constraints: "Output only the label, no explanation" reduces hallucination. Handle uncertainty: "If unclear, output 'UNCLEAR'" — flag for manual review.

💡 Cost vs accuracy tradeoff: Haiku (cheap, ~70% accuracy) + human review beats Opus alone (expensive, ~90% accuracy) if labor is available. Math: 10K examples × $0.0005/Haiku + 2K corrections × $0.05 human = $105. vs 10K × $0.005/Opus = $50. But Haiku path is more accurate after human review.

06 — Workflow Design

Workflow Design

Task Decomposition

Break complex tasks into simpler subtasks. Instead of "label sentiment & offensive language", have separate tasks. Easier to train, higher agreement, easier to outsource.

Batching & Assignment

Assign 20–50 examples per batch to one annotator. Let them build context; reduces errors. Assign same batch to 2 annotators for spot-checking (every 100 examples).

Annotator Management

Track annotator productivity and quality. Create profiles: some annotators fast but loose (data augmentation), others slow but thorough (gold standard). Use data accordingly.

Version Control for Labels

Store labels with metadata: annotator ID, timestamp, version number. If guidelines change, re-label old data with new version. Track label lineage.

Iterative Refinement

After collecting 1000 examples, run basic model. Evaluate on test set. Identify low-confidence examples; send back to annotators for clarification. Refine guidelines based on model failures.

⚠️ Common pitfall: Labeling 100K examples all at once, then discovering guidelines were unclear or data was biased. Instead: label 1K → evaluate → refine → label 5K → evaluate → scale to 100K. Iteration is faster and cheaper overall.

07 — Ecosystem

Tools & Platforms

Label Studio

Open-source annotation tool. Web UI, API, multi-task support. Self-hosted. Good for in-house teams.

Scale AI

Managed labeling service. Expert annotators, quality guarantees, fast turnaround. Expensive but high quality.

Argilla

Open-source data labeling & curation. Model-in-the-loop, active learning, weak supervision.

Prodigy

Annotation tool with active learning. Iterative labeling; model suggests uncertain examples.

Labelbox

Enterprise labeling platform. Data versioning, review workflows, quality metrics built-in.

Snorkel

Weak supervision framework. Write labeling functions instead of labels; programmatic data generation.

CVAT

Computer vision annotation tool. Boxes, polygons, segmentation. Extensible; works for text too.

Cleanlab

Data quality platform. Identifies mislabeled examples; ranks data by confidence for active learning.

08 — Further Reading

References

Academic Papers

Paper Ratner, A. et al. (2016). Data Programming: Creating Large Training Sets Quickly. Stanford. arXiv:1605.07489. — arxiv:1605.07489 ↗
Paper Northcutt, C. et al. (2021). Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv:1911.00068. — arxiv:1911.00068 ↗

Documentation & Guides

Docs Label Studio — Annotation Platform. labelstudio.io ↗
Docs Snorkel — Weak Supervision Framework. snorkel.org ↗
Docs Argilla — Data Curation Platform. argilla.io ↗
Guide Scale AI — Data Labeling Best Practices. scale.com ↗

Practitioner Writing

Blog Made With ML. (2023). Data Labeling & Quality Control. — madewithml.com ↗
Blog Chip Huyen. (2022). Practical Guide to Data Annotation. — huyenchip.com ↗

Data Labeling for AI

Labeling Approaches

Task Types for LLMs

Classification

Instruction Following

Preference Pairs

RLHF Comparison

Span Labeling

Annotation Guidelines

Label Schema Design

Clear Definitions

Edge Cases & Examples

Calibration Sessions

Quality Control

Inter-Annotator Agreement (IAA)

Python Example: Cohen's Kappa

Dispute Resolution

Gold Standard Sets

Auditing Pipelines

LLM-Assisted Labeling

Workflow

Design LLM Prompt — capture nuance

Pre-label Batch — generate candidates

Human Review — correct & validate

Calibration — track accuracy

Prompt Design Tips

Workflow Design

Task Decomposition

Batching & Assignment

Annotator Management

Version Control for Labels

Iterative Refinement

Tools & Platforms

Label Studio

Scale AI

Argilla

Prodigy

Labelbox

Snorkel

CVAT

Cleanlab

References

Related concepts