Human annotation, weak supervision, and LLM-assisted labeling — workflows, quality control, and tooling
| Approach | Quality | Cost | Speed | Best for |
|---|---|---|---|---|
| Human expert | Very high | Very high | Slow | Critical, niche domains |
| Crowdsourcing | Medium–high | Medium | Fast | Large volume, clear tasks |
| Weak supervision | Medium | Low | Very fast | Rules-based labeling |
| LLM-assisted | Medium–high | Low–medium | Fast | Pre-labeling, bootstrapping |
| Model-in-the-loop | High | Medium | Medium | Active learning scenarios |
Label examples into predefined categories: sentiment, topic, intent. Simple to annotate; clear ground truth. Examples: "Is this review positive or negative?" (binary), "Classify support ticket into: billing, technical, general" (multi-class).
Provide a task, model responds, human judges if response follows instructions. Used for SFT: "Write a haiku about programming" → model → human: "Good" or "Bad". Harder than classification; requires judgment.
Show two model outputs A and B, human picks the better one. No absolute quality; only relative ranking. Used for DPO and RLHF. Faster than absolute scoring; harder to calibrate.
Gather preference data specifically for RLHF training. Questions with multiple response options (A vs B vs C). Annotators rank them. Requires clear preference definitions: helpfulness, factuality, safety.
Mark named entities, key phrases, or sensitive information within text. Used for information extraction, redaction. Requires precise boundary marking; higher disagreement than classification.
Start with the taxonomy: What are all possible labels? Make them mutually exclusive and collectively exhaustive. Example: sentiment should be {positive, negative, neutral}, not {happy, sad, angry} (overlapping and incomplete).
For each label, write: definition, when to use it, when NOT to use it, examples. Ambiguous labels cause low inter-annotator agreement (IAA). Example: "Spam = unsolicited marketing or scams. NOT product recommendations requested by user."
Document tricky cases: sarcasm, multilingual, multiple interpretations. Include 3–5 worked examples per label. Show reasoning: "This is label X because..."
Monthly meetings: review disagreed examples, align on standards, update guidelines. Prevents annotator drift (interpretations shifting over time).
Measure consensus. Have 3 annotators label same examples. Calculate Cohen's kappa (binary), Fleiss' kappa (multi-annotator), or Krippendorff's alpha. Target: kappa > 0.80 (substantial agreement). < 0.60 means guidelines are unclear.
When 3 annotators disagree (1-1-1 split), expert adjudicates. Document decision; update guidelines if this pattern repeats.
Expert-labeled reference examples. Test each annotator against gold set monthly. If accuracy < 85%, retrain or remove annotator.
Randomly sample 100 labeled examples weekly. Have lead annotator review; flag errors. Track error rate by annotator, task, and date. Trigger retraining if error rate > 5%.
Use Claude/GPT-4 to pre-label examples → humans review → train on reviewed data. Trades cost for quality: LLM often gets 70–85% right, humans catch the errors. Net result: 3–5× faster annotation with minimal quality loss if humans are thorough.
Embed label definitions, examples, edge cases into the prompt. Prompt quality directly impacts pre-label accuracy. Test on 20 examples first.
from openai import OpenAI
from pydantic import BaseModel, Field
client = OpenAI()
class Label(BaseModel):
label: str
confidence: float = Field(ge=0.0, le=1.0)
reasoning: str
def auto_label(text: str, label_options: list[str],
task_description: str) -> Label:
"""Auto-label text using LLM with confidence score."""
options_str = ", ".join(f'"{l}"' for l in label_options)
result = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content":
f"Task: {task_description}
"
f"Text: {text}
"
f"Choose from: {options_str}
"
f"Give label, confidence (0-1), and one-line reasoning."}],
response_format=Label,
temperature=0.0
)
return result.choices[0].message.parsed
def label_dataset(texts: list[str], labels: list[str],
task: str, confidence_threshold: float = 0.85) -> dict:
"""Label dataset, routing low-confidence items to human review."""
auto_labeled, human_queue = [], []
for text in texts:
result = auto_label(text, labels, task)
if result.confidence >= confidence_threshold:
auto_labeled.append({"text": text, "label": result.label,
"confidence": result.confidence})
else:
human_queue.append({"text": text, "predicted": result.label,
"confidence": result.confidence})
print(f"Auto-labeled: {len(auto_labeled)} | Human review: {len(human_queue)}")
return {"auto": auto_labeled, "human_queue": human_queue}
# Example
result = label_dataset(
texts=["Great product, very happy!", "Terrible service, never again."],
labels=["positive", "negative", "neutral"],
task="Classify customer review sentiment"
)
from collections import Counter
import statistics
def cohens_kappa(labels_a: list, labels_b: list) -> float:
"""Compute Cohen's Kappa — agreement corrected for chance."""
assert len(labels_a) == len(labels_b)
n = len(labels_a)
categories = set(labels_a) | set(labels_b)
# Observed agreement
p_o = sum(a == b for a, b in zip(labels_a, labels_b)) / n
# Expected agreement by chance
p_e = sum(
(labels_a.count(cat) / n) * (labels_b.count(cat) / n)
for cat in categories
)
return (p_o - p_e) / (1 - p_e) if p_e < 1 else 1.0
def majority_vote(label_sets: list[list]) -> list:
"""Aggregate multiple annotators via majority vote."""
return [Counter(labels).most_common(1)[0][0] for labels in zip(*label_sets)]
def annotation_stats(label_sets: list[list]) -> dict:
"""Compute agreement stats across annotators."""
# Pairwise kappa
kappas = []
for i in range(len(label_sets)):
for j in range(i+1, len(label_sets)):
kappas.append(cohens_kappa(label_sets[i], label_sets[j]))
final_labels = majority_vote(label_sets)
# Confusion rate: fraction of items where annotators disagree
confusion = sum(
len(set(item)) > 1 for item in zip(*label_sets)
) / len(label_sets[0])
return {
"mean_kappa": round(statistics.mean(kappas), 3),
"min_kappa": round(min(kappas), 3),
"confusion_rate": round(confusion, 3),
"interpretation": ("good" if statistics.mean(kappas) > 0.6
else "moderate" if statistics.mean(kappas) > 0.4
else "poor — review guidelines")
}
# Example with 3 annotators
a1 = ["pos", "neg", "pos", "neu", "neg"]
a2 = ["pos", "neg", "neu", "neu", "neg"]
a3 = ["pos", "neg", "pos", "pos", "neg"]
print(annotation_stats([a1, a2, a3]))
Run LLM on all unlabeled examples. Save output. Usually fast and cheap (Haiku preferable for cost).
Show humans LLM label + example. They accept, reject, or modify. Faster than labeling from scratch; quality is high if humans are careful.
Have LLM label 100 gold-standard examples. Calculate accuracy. If < 75%, iterate on prompt. If > 85%, proceed to full batch.
Be specific: "Classify as positive, negative, or neutral" vs "Is this good?". Give examples: Show label + reasoning for 2–3 examples. Set constraints: "Output only the label, no explanation" reduces hallucination. Handle uncertainty: "If unclear, output 'UNCLEAR'" — flag for manual review.
Break complex tasks into simpler subtasks. Instead of "label sentiment & offensive language", have separate tasks. Easier to train, higher agreement, easier to outsource.
Assign 20–50 examples per batch to one annotator. Let them build context; reduces errors. Assign same batch to 2 annotators for spot-checking (every 100 examples).
Track annotator productivity and quality. Create profiles: some annotators fast but loose (data augmentation), others slow but thorough (gold standard). Use data accordingly.
Store labels with metadata: annotator ID, timestamp, version number. If guidelines change, re-label old data with new version. Track label lineage.
After collecting 1000 examples, run basic model. Evaluate on test set. Identify low-confidence examples; send back to annotators for clarification. Refine guidelines based on model failures.
Open-source annotation tool. Web UI, API, multi-task support. Self-hosted. Good for in-house teams.
Managed labeling service. Expert annotators, quality guarantees, fast turnaround. Expensive but high quality.
Open-source data labeling & curation. Model-in-the-loop, active learning, weak supervision.
Annotation tool with active learning. Iterative labeling; model suggests uncertain examples.
Enterprise labeling platform. Data versioning, review workflows, quality metrics built-in.
Weak supervision framework. Write labeling functions instead of labels; programmatic data generation.
Computer vision annotation tool. Boxes, polygons, segmentation. Extensible; works for text too.
Data quality platform. Identifies mislabeled examples; ranks data by confidence for active learning.