Training · Data Quality

Data-Centric AI

Shifting focus from model to data — quality over quantity, curation pipelines, and dataset management

5 techniques
6 sections
Python first
Contents
  1. The data-centric shift
  2. Quality dimensions
  3. Curation pipeline
  4. Dedup & filtering
  5. Annotation best practices
  6. Dataset management
  7. Tools & frameworks
  8. References
01 — Philosophy

The Data-Centric Shift

Model-centric: Fix the model architecture, optimize hyperparameters, scale training. Assume data is fixed. Data-centric: Fix the model (use standard architecture), optimize the data. Assume data quality is the constraint.

Evidence: 10,000 high-quality examples beat 1 million noisy ones. Chinchilla scaling laws show equal compute between pretraining and data. OpenAI's O1 spends more compute on reasoning (verification) than generation.

Why Data Matters More Now

Scaling Plateau

  • Model scaling has limits (Chinchilla, Kaplan laws)
  • Quality improvements outlast size improvements
  • Data diversity matters more than size

Diminishing Returns

  • 1B more parameters → small gain
  • 1B better examples → large gain
  • Data cleaning beats model tuning

Cost Structure

  • Model training is fixed cost
  • Data quality impacts all downstream uses
  • Bad data propagates downstream forever

Competitive Moat

  • Models open-source quickly
  • Data quality is proprietary/hard to replicate
  • Dataset quality = competitive advantage
💡 Spend 70% of effort on data, 30% on models. Most teams flip this. Data work feels less impressive but compounds.
02 — Measurement

Data Quality Dimensions

Accuracy

Does the label match the example? Image of cat labeled "dog" = error. Typos in text. Misaligned (prompt, response) pairs.

Completeness

Is all necessary information present? Truncated sentences. Incomplete images. Missing context that makes labels ambiguous.

Consistency

Are labels consistent across similar examples? Same cat image labeled "cat" once, "animal" another time. Inconsistent annotation guidelines.

Coverage

Does the dataset represent the distribution you care about? All examples from one domain. Underrepresented edge cases.

Freshness

Is data up-to-date? Old training data used for current tasks. Concept drift (examples become stale).

Quality Checklist

📋 Before training: ✓ Remove duplicates ✓ Check accuracy (sample 100) ✓ Verify label distribution ✓ Remove PII ✓ Document provenance ✓ Version dataset
Quality DimensionProblemDetection MethodFix
CorrectnessWrong labels, factual errorsCross-annotator agreement, LLM auditRe-label, expert review
CompletenessMissing fields, truncated textSchema validation, length histogramsFilter or impute
ConsistencySame input, different labelsDuplicate detection, inter-rater reliabilityAdjudication, majority vote
RelevanceOff-domain examples polluting setEmbedding clustering, topic modelingDomain classifier, manual curation
DiversityOver-represented topics/stylesEmbedding coverage analysisStratified sampling, synthetic data
03 — Process

Curation Pipeline

Workflow

1
Collection. Gather raw data from sources (logs, APIs, crowdsourcing)
2
Deduplication. Remove duplicates (exact and near-duplicates)
3
Quality Filter. Remove low-quality examples (typos, truncation, irrelevance)
4
Annotation. Label if needed. Ensure quality (inter-annotator agreement)
5
Versioning. Track changes, commit to version control
Python · LLM-assisted data quality scoring and filtering
import json
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class QualityScore(BaseModel):
    score: int = Field(ge=1, le=5, description="Overall quality 1-5")
    issues: list[str] = Field(description="List of quality issues found")
    keep: bool = Field(description="Whether to keep this example in the dataset")

def score_training_example(input_text: str, output_text: str,
                            task_description: str) -> QualityScore:
    """Use LLM to score a training example for quality."""
    prompt = f"""Evaluate this training example for a {task_description} model.

Input: {input_text[:500]}
Output: {output_text[:500]}

Score on: accuracy, completeness, clarity, and task-relevance.
Score 1=poor/harmful, 3=acceptable, 5=excellent."""

    result = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format=QualityScore,
        temperature=0.0
    )
    return result.choices[0].message.parsed

def filter_dataset(examples: list[dict], task: str,
                   min_score: int = 3) -> list[dict]:
    """Filter a dataset, keeping only high-quality examples."""
    filtered = []
    for ex in examples:
        score = score_training_example(ex["input"], ex["output"], task)
        if score.keep and score.score >= min_score:
            filtered.append({**ex, "_quality_score": score.score})
        else:
            print(f"Dropped (score={score.score}): {ex['input'][:50]}... Issues: {score.issues}")
    print(f"Kept {len(filtered)}/{len(examples)} examples")
    return filtered

Iterate: After first training run, analyze errors. Where does the model fail? Collect more examples for those cases.

04 — Cleaning

Deduplication & Filtering

MinHash LSH for Near-Duplicates

Don't just do exact string matching. Similar examples (paraphrases, same fact rephrased) are nearly-duplicates. MinHash Locality Sensitive Hashing efficiently finds them.

from datasketch import MinHashLSH, MinHash import numpy as np def minhash_dedup(texts, threshold=0.7): """Find near-duplicates using MinHash LSH.""" lsh = MinHashLSH(threshold=threshold, num_perm=128) deduped = [] for i, text in enumerate(texts): # Create MinHash signature m = MinHash() for word in text.split(): m.update(word.encode('utf8')) # Check if similar to any existing if lsh.query(m): continue # Skip duplicates lsh.insert(str(i), m) deduped.append(text) return deduped # Usage texts = ["The sky is blue", "The sky is blue", "Sky is blue", "Water is wet"] clean = minhash_dedup(texts) print(clean) # ["The sky is blue", "Water is wet"]

Perplexity Filtering for Quality

Train a language model on high-quality data. Use its perplexity to score new examples. Low perplexity = looks like good data. High perplexity = outlier (possibly noisy).

fastText Quality Classifier: Train a binary classifier (good / bad) on labeled examples. Score new examples. Threshold to filter low-confidence predictions.

💡 Dedup + filter = 30–50% size reduction with quality gains. Spend time here; it compounds across all downstream work.
Python · Dataset deduplication using MinHash LSH
from datasketch import MinHash, MinHashLSH
import re

def text_to_shingles(text: str, k: int = 5) -> set[str]:
    """Convert text to k-character shingles for MinHash."""
    text = re.sub(r'\s+', ' ', text.lower().strip())
    return {text[i:i+k] for i in range(len(text) - k + 1)}

def build_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for shingle in text_to_shingles(text):
        m.update(shingle.encode('utf8'))
    return m

def deduplicate_dataset(texts: list[str], threshold: float = 0.8) -> list[int]:
    """Return indices of unique examples after near-duplicate removal."""
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    keep = []
    for i, text in enumerate(texts):
        mh = build_minhash(text)
        if not lsh.query(mh):  # no near-duplicates found
            lsh.insert(f"doc_{i}", mh)
            keep.append(i)
    return keep

# Example
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog!",   # near-duplicate
    "A completely different sentence about machine learning.",
    "The quick brown fox jumped over the lazy dog.",  # near-duplicate
]
unique_indices = deduplicate_dataset(documents, threshold=0.8)
print(f"Kept {len(unique_indices)}/{len(documents)}: {unique_indices}")
# Kept 2/4: [0, 2]
05 — Labeling

Annotation Best Practices

Annotation Guidelines

Write clear rubrics. Show examples of edge cases. Define corner cases explicitly. Annotators need clarity.

Inter-Annotator Agreement

Have 2–3 people label the same examples. Measure agreement (Cohen's kappa, Krippendorff's alpha). If agreement is low (<0.8), guidelines are unclear. Refine.

Active Learning Loops

Don't label randomly. Train a model on labeled data. Find examples the model is uncertain about. Label those first. Maximizes information per label.

Quality Control

Gold set: Maintain a small set of high-quality, carefully labeled examples. Use to spot-check annotator quality. Attention checks: Include obvious examples. Annotators who fail are unreliable.

⚠️ Cheap annotation → garbage data. Pay annotators fairly. Provide training. Monitor quality continuously. Bad annotation is worse than no annotation.
06 — Operations

Dataset Management

Versioning with DVC / Hugging Face Datasets

DVC (Data Version Control): Track dataset versions like code. Commit dataset versions alongside code. Reproducible experiments.

Hugging Face Datasets: Upload and version datasets. Easy sharing. Built-in filtering and split management.

Lineage Tracking

Record: Where did this data come from? What transformations were applied? Who labeled it? When? Track this in metadata.

Train / Val / Test Splits

Temporal split: Time-series data? Use time as split criterion. Stratified split: Imbalanced classes? Maintain proportions in train/val/test. Domain split: Different domains? Test on unseen domain.

Maintenance

Monitor model performance over time. If it drops, check the data. Distribution shift? New types of queries? Update dataset quarterly.

07 — Ecosystem

Tools & Frameworks

Labeling
Argilla
Annotation platform. Crowdsourcing, active learning, quality control.
Labeling
Label Studio
Open-source annotation tool. Images, text, audio. Export to standard formats.
Cleaning
Cleanlab
Find mislabeled examples. Estimate label quality. Data cleaning automation.
Versioning
DVC
Data version control. Track datasets like git. Reproducible pipelines.
Hub
Hugging Face Datasets
Dataset versioning, sharing, processing. 10K+ community datasets.
Benchmarking
DataComp
Open source data foundation. Curated datasets for pretraining.
Curation
Dolma
Allen AI's pretraining corpus. 3 trillion tokens. Open for research.
Benchmark
The Stack
3.1M open source repos. Code training data. Source control.
08 — Further Reading

References

Research & Papers
  • Paper Zhu, X., Lafferty, J., & Rosenfeld, R. (2003). Semi-Supervised Learning With Graphs. CMU. — cmu.edu ↗
  • Paper Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556. — arxiv:2203.15556 ↗
  • Paper Northcutt, C. G., et al. (2021). Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv:1911.00068. — arxiv:1911.00068 ↗
Tools & Platforms
Practitioner Writing