Techniques for detecting and correcting label errors in training datasets — using disagreement analysis, model confidence, Cleanlab, and inter-annotator agreement metrics.
Research shows 3–10% of labels in widely-used benchmarks (ImageNet, CIFAR, etc.) are incorrect. Label noise degrades model performance, makes loss curves unstable, and causes models to learn the noise patterns. A model trained on clean data with 5% fewer examples consistently outperforms one trained on noisy data with those extra examples.
Measure agreement before training to quantify label quality. " "Cohen's kappa > 0.7 is acceptable; > 0.8 is good; < 0.6 signals annotation problems.
from sklearn.metrics import cohen_kappa_score
import numpy as np
def measure_agreement(annotator_a: list, annotator_b: list) -> dict:
kappa = cohen_kappa_score(annotator_a, annotator_b)
agreement_rate = sum(a == b for a, b in zip(annotator_a, annotator_b)) / len(annotator_a)
return {
"cohens_kappa": round(kappa, 3),
"raw_agreement": round(agreement_rate, 3),
"interpretation": "good" if kappa > 0.7 else "acceptable" if kappa > 0.5 else "poor"
}
# Find specific disagreement cases
def find_disagreements(annotator_a: list, annotator_b: list, texts: list) -> list:
return [
{"text": t, "label_a": a, "label_b": b}
for t, a, b in zip(texts, annotator_a, annotator_b)
if a != b
]
Confident Learning (Northcutt et al. 2021, implemented in Cleanlab) detects label errors by cross-referencing model confidence with assigned labels. A model trained on the dataset assigns low confidence to its own training examples that have incorrect labels — because the label contradicts the pattern the model learned. This self-signal identifies likely label errors without re-annotation.
Cleanlab automates confident learning for classification tasks.
import cleanlab
from cleanlab.classification import CleanLearning
from sklearn.linear_model import LogisticRegression
# Method 1: Find label issues
from cleanlab.filter import find_label_issues
# Get out-of-sample predicted probabilities (cross-validation)
pred_probs = cross_val_predict(LogisticRegression(), X_train, y_train,
cv=5, method="predict_proba")
# Find likely mislabelled examples
label_issues = find_label_issues(
labels=y_train,
pred_probs=pred_probs,
return_indices_ranked_by="self_confidence",
)
print(f"Found {len(label_issues)} potential label errors out of {len(y_train)}")
# Method 2: Auto-fix labels
cl = CleanLearning(clf=LogisticRegression())
cl.fit(X_train, y_train) # Automatically handles label noise during training
Don't automatically remove all flagged examples — review them. Sort flagged examples by confidence score (lowest confidence = most likely wrong). Review the top 20% manually: correct the label if clearly wrong, discard if ambiguous. Keep a review log: date, reviewer, original label, corrected label, decision rationale. Re-run Cleanlab after corrections to verify improvement.
Label quality improvement is iterative: (1) Train model on current labels. (2) Identify low-confidence examples with Cleanlab. (3) Review and correct 500–1000 highest-priority examples. (4) Retrain and re-evaluate. Repeat 2–3 cycles. Each cycle typically improves accuracy by 1–3pp with no new data collection — pure quality improvement on existing labels.
Label quality depends on annotator agreement and ground truth accuracy. When multiple annotators label the same item, disagreements reveal ambiguity: either the task is unclear (guideline issue) or the item is genuinely ambiguous (requires context or domain expertise). Consensus approaches aggregate: majority vote is simplest; Dawid-Skene models annotator reliability; Bayesian approaches integrate uncertainty. For gold-standard evaluation, use a 3-annotator consensus on a subset, then have a senior expert resolve remaining disagreements. Track which examples had low agreement — these are learning opportunities for your model.
| Quality Metric | Scale | Interpretation | Action |
|---|---|---|---|
| Inter-Annotator Agreement | 0–1 | >0.6 good, <0.4 poor | Retrain annotators |
| Confident Learning | 0–1 | Identifies label errors | Correct or remove |
| Worker Accuracy | 0–1 | Accuracy vs. gold set | Audit or replace worker |
| Crowd Consensus | 0–1 | Fraction agreeing | Escalate low-consensus items |
def confident_learning_score(true_labels: list, pred_labels: list) -> float:
"""Cleanlab's confident learning: identify potential label errors."""
import numpy as np
# Estimate label noise transition matrix
def estimate_transitions(y_true, y_pred, num_classes):
confusion = np.zeros((num_classes, num_classes))
for i in range(len(y_true)):
confusion[y_true[i], y_pred[i]] += 1
return confusion / confusion.sum(axis=1, keepdims=True)
transitions = estimate_transitions(true_labels, pred_labels, len(set(true_labels)))
# Items with low probability under the estimated model are errors
return transitionsModern label quality methods go beyond agreement. Confident Learning (cleanlab) identifies likely label errors by comparing model predictions to labels — if a model trained on the labels consistently predicts differently on a subset, those examples are suspicious. Worker accuracy tracking measures each annotator's agreement with the gold set, enabling quality-based weighting. Statistical process control charts (track accuracy over time) catch drift. Regular audits (sample 100 items, manually review) act as a sanity check. A robust labeling system combines agreement metrics, error detection, and continuous monitoring — not as one-time checks, but as ongoing quality assurance throughout the project.
Disagreement is information, not noise. When annotators disagree, it reveals task ambiguity. High disagreement (inter-annotator agreement <0.4) suggests the task definition is unclear, guidelines are ambiguous, or the phenomenon is genuinely hard. The fix: revise guidelines, do calibration rounds (all annotators label the same 20 items, discuss disagreements, align), or select only unambiguous examples for training. Some disagreement is unavoidable and acceptable; machine learning models also make mistakes. The goal isn't agreement on 100% of examples but agreement on the core concepts. Identify examples with high disagreement and exclude them or flag them as "inherently ambiguous" (useful for understanding model behavior on unclear inputs).
Worker reliability estimation is underrated. Different annotators have different skill levels, biases, and consistency. Assign reliability scores based on historical performance: how often does annotator A agree with annotator B? Are their errors systematic (always misses rare classes) or random? Use this information: weight high-reliability annotators more heavily in consensus decisions, or use their labels to calibrate low-reliability annotators. Dawid-Skene models are a principled approach: estimate both the true labels and annotator confusion matrices simultaneously. Simpler methods (reliability as agreement percentage) work surprisingly well. Tools like Prodigy track this automatically; manual systems require careful logging.
Quality assurance over time is challenging. New annotators often have a ramp-up period where quality improves. Experienced annotators can suffer drift—their standards shift over time. Monitor this: track each annotator's agreement with the gold set over time (does it trend down?). Set thresholds: if an annotator drops below 80% accuracy on gold, trigger retraining or retirement. Regular audits (spot-check 10% of annotations monthly) catch drift before it becomes a major problem. For long-running projects (months or years), the gold set itself might drift or become outdated—periodically refresh it with new expert reviews. A robust labeling system treats quality as an ongoing concern, not a one-time setup.
Active learning for labeling (which examples to send to annotators?) is an underused technique. Uncertainty-based selection (send examples the current model is most uncertain about) can reduce labeling effort. The idea: annotate the examples your model is struggling with, not random examples. Disagreement-based selection (send examples to multiple annotators if the model is uncertain) also works. Some systems use model-driven sampling: train a classifier on labeled data, identify uncertain examples, request labels for those, retrain. This is an outer loop of active learning applied to the annotation process itself. Savings can be 30–50% fewer annotations for the same model performance. The downside: you need a model to guide selection, which requires initial labeled data (chicken-and-egg problem). For new tasks without any labels, start with random sampling or heuristic selection (diversity-based, importance-based) until you have enough labels to train a model.
Crowdsourcing platforms introduce a scale dimension. Mechanical Turk, Scale AI, and others provide pools of annotators. Scale and quality are in tension: fast turnaround (1000 labels in an hour) often means lower quality; high quality (expert annotators) means slower turnaround. Best practices: (1) Qualification tests before real work (verify annotator competence), (2) Gold examples in the task (catch careless annotators), (3) Attention checks (random questions to detect automation), (4) Feedback loops (show annotators their accuracy to encourage improvement). Tools like Kili and Label Studio integrate with crowd platforms. For internal teams (your employees annotating), the dynamics differ: longer-term relationships enable better training, but turnover and scaling are harder. Many organizations use hybrid approaches: internal teams handle complex or sensitive tasks, crowdsourcing handles volume work. The mix depends on your data security needs, budget, and timeline.
Beyond aggregate metrics, consider per-example quality. Some examples are harder than others; disagreement might indicate inherent ambiguity rather than annotator error. Identify hard examples and decide: exclude them (simpler model, clean labels), keep them (model learns uncertainty), or relabel with more annotators (consensus on hard examples). Some systems use different training procedures for hard vs. easy examples: weight hard examples higher (focus learning), or weight them lower (focus on core patterns). There's no universal rule; it's task-dependent. For classification with natural class imbalance (rare disease detection), hard examples are valuable signal; for sentiment analysis on uniform text, hard examples might just be noise. Analyze your disagreement patterns: is it systematic (some annotators always rate higher) or example-specific? Systematic disagreement suggests annotator bias; example-specific disagreement suggests true ambiguity. Correcting bias improves overall quality; accepting ambiguity improves generalization.