Data-Centric AI

Contents

The data-centric shift
Quality dimensions
Curation pipeline
Dedup & filtering
Annotation best practices
Dataset management
Tools & frameworks
References

01 — Philosophy

The Data-Centric Shift

Model-centric: Fix the model architecture, optimize hyperparameters, scale training. Assume data is fixed. Data-centric: Fix the model (use standard architecture), optimize the data. Assume data quality is the constraint.

Evidence: 10,000 high-quality examples beat 1 million noisy ones. Chinchilla scaling laws show equal compute between pretraining and data. OpenAI's O1 spends more compute on reasoning (verification) than generation.

Why Data Matters More Now

Scaling Plateau

Model scaling has limits (Chinchilla, Kaplan laws)
Quality improvements outlast size improvements
Data diversity matters more than size

Diminishing Returns

1B more parameters → small gain
1B better examples → large gain
Data cleaning beats model tuning

Cost Structure

Model training is fixed cost
Data quality impacts all downstream uses
Bad data propagates downstream forever

Competitive Moat

Models open-source quickly
Data quality is proprietary/hard to replicate
Dataset quality = competitive advantage

💡 Spend 70% of effort on data, 30% on models. Most teams flip this. Data work feels less impressive but compounds.

02 — Measurement

Data Quality Dimensions

Accuracy

Does the label match the example? Image of cat labeled "dog" = error. Typos in text. Misaligned (prompt, response) pairs.

Completeness

Is all necessary information present? Truncated sentences. Incomplete images. Missing context that makes labels ambiguous.

Consistency

Are labels consistent across similar examples? Same cat image labeled "cat" once, "animal" another time. Inconsistent annotation guidelines.

Coverage

Does the dataset represent the distribution you care about? All examples from one domain. Underrepresented edge cases.

Freshness

Is data up-to-date? Old training data used for current tasks. Concept drift (examples become stale).

Quality Checklist

📋 Before training: ✓ Remove duplicates ✓ Check accuracy (sample 100) ✓ Verify label distribution ✓ Remove PII ✓ Document provenance ✓ Version dataset

Quality Dimension	Problem	Detection Method	Fix
Correctness	Wrong labels, factual errors	Cross-annotator agreement, LLM audit	Re-label, expert review
Completeness	Missing fields, truncated text	Schema validation, length histograms	Filter or impute
Consistency	Same input, different labels	Duplicate detection, inter-rater reliability	Adjudication, majority vote
Relevance	Off-domain examples polluting set	Embedding clustering, topic modeling	Domain classifier, manual curation
Diversity	Over-represented topics/styles	Embedding coverage analysis	Stratified sampling, synthetic data

03 — Process

Curation Pipeline

Workflow

Collection. Gather raw data from sources (logs, APIs, crowdsourcing)

Deduplication. Remove duplicates (exact and near-duplicates)

Quality Filter. Remove low-quality examples (typos, truncation, irrelevance)

Annotation. Label if needed. Ensure quality (inter-annotator agreement)

Versioning. Track changes, commit to version control

Python · LLM-assisted data quality scoring and filtering

import json
from openai import OpenAI
from pydantic import BaseModel, Field

client = OpenAI()

class QualityScore(BaseModel):
    score: int = Field(ge=1, le=5, description="Overall quality 1-5")
    issues: list[str] = Field(description="List of quality issues found")
    keep: bool = Field(description="Whether to keep this example in the dataset")

def score_training_example(input_text: str, output_text: str,
                            task_description: str) -> QualityScore:
    """Use LLM to score a training example for quality."""
    prompt = f"""Evaluate this training example for a {task_description} model.

Input: {input_text[:500]}
Output: {output_text[:500]}

Score on: accuracy, completeness, clarity, and task-relevance.
Score 1=poor/harmful, 3=acceptable, 5=excellent."""

    result = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format=QualityScore,
        temperature=0.0
    )
    return result.choices[0].message.parsed

def filter_dataset(examples: list[dict], task: str,
                   min_score: int = 3) -> list[dict]:
    """Filter a dataset, keeping only high-quality examples."""
    filtered = []
    for ex in examples:
        score = score_training_example(ex["input"], ex["output"], task)
        if score.keep and score.score >= min_score:
            filtered.append({**ex, "_quality_score": score.score})
        else:
            print(f"Dropped (score={score.score}): {ex['input'][:50]}... Issues: {score.issues}")
    print(f"Kept {len(filtered)}/{len(examples)} examples")
    return filtered

Iterate: After first training run, analyze errors. Where does the model fail? Collect more examples for those cases.

04 — Cleaning

Deduplication & Filtering

MinHash LSH for Near-Duplicates

Don't just do exact string matching. Similar examples (paraphrases, same fact rephrased) are nearly-duplicates. MinHash Locality Sensitive Hashing efficiently finds them.

from datasketch import MinHashLSH, MinHash import numpy as np def minhash_dedup(texts, threshold=0.7): """Find near-duplicates using MinHash LSH.""" lsh = MinHashLSH(threshold=threshold, num_perm=128) deduped = [] for i, text in enumerate(texts): # Create MinHash signature m = MinHash() for word in text.split(): m.update(word.encode('utf8')) # Check if similar to any existing if lsh.query(m): continue # Skip duplicates lsh.insert(str(i), m) deduped.append(text) return deduped # Usage texts = ["The sky is blue", "The sky is blue", "Sky is blue", "Water is wet"] clean = minhash_dedup(texts) print(clean) # ["The sky is blue", "Water is wet"]

Perplexity Filtering for Quality

Train a language model on high-quality data. Use its perplexity to score new examples. Low perplexity = looks like good data. High perplexity = outlier (possibly noisy).

fastText Quality Classifier: Train a binary classifier (good / bad) on labeled examples. Score new examples. Threshold to filter low-confidence predictions.

💡 Dedup + filter = 30–50% size reduction with quality gains. Spend time here; it compounds across all downstream work.

Python · Dataset deduplication using MinHash LSH

from datasketch import MinHash, MinHashLSH
import re

def text_to_shingles(text: str, k: int = 5) -> set[str]:
    """Convert text to k-character shingles for MinHash."""
    text = re.sub(r'\s+', ' ', text.lower().strip())
    return {text[i:i+k] for i in range(len(text) - k + 1)}

def build_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for shingle in text_to_shingles(text):
        m.update(shingle.encode('utf8'))
    return m

def deduplicate_dataset(texts: list[str], threshold: float = 0.8) -> list[int]:
    """Return indices of unique examples after near-duplicate removal."""
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    keep = []
    for i, text in enumerate(texts):
        mh = build_minhash(text)
        if not lsh.query(mh):  # no near-duplicates found
            lsh.insert(f"doc_{i}", mh)
            keep.append(i)
    return keep

# Example
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps over the lazy dog!",   # near-duplicate
    "A completely different sentence about machine learning.",
    "The quick brown fox jumped over the lazy dog.",  # near-duplicate
]
unique_indices = deduplicate_dataset(documents, threshold=0.8)
print(f"Kept {len(unique_indices)}/{len(documents)}: {unique_indices}")
# Kept 2/4: [0, 2]

05 — Labeling

Annotation Best Practices

Annotation Guidelines

Write clear rubrics. Show examples of edge cases. Define corner cases explicitly. Annotators need clarity.

Inter-Annotator Agreement

Have 2–3 people label the same examples. Measure agreement (Cohen's kappa, Krippendorff's alpha). If agreement is low (<0.8), guidelines are unclear. Refine.

Active Learning Loops

Don't label randomly. Train a model on labeled data. Find examples the model is uncertain about. Label those first. Maximizes information per label.

Quality Control

Gold set: Maintain a small set of high-quality, carefully labeled examples. Use to spot-check annotator quality. Attention checks: Include obvious examples. Annotators who fail are unreliable.

⚠️ Cheap annotation → garbage data. Pay annotators fairly. Provide training. Monitor quality continuously. Bad annotation is worse than no annotation.

06 — Operations

Dataset Management

Versioning with DVC / Hugging Face Datasets

DVC (Data Version Control): Track dataset versions like code. Commit dataset versions alongside code. Reproducible experiments.

Hugging Face Datasets: Upload and version datasets. Easy sharing. Built-in filtering and split management.

Lineage Tracking

Record: Where did this data come from? What transformations were applied? Who labeled it? When? Track this in metadata.

Train / Val / Test Splits

Temporal split: Time-series data? Use time as split criterion. Stratified split: Imbalanced classes? Maintain proportions in train/val/test. Domain split: Different domains? Test on unseen domain.

Maintenance

Monitor model performance over time. If it drops, check the data. Distribution shift? New types of queries? Update dataset quarterly.

07 — Ecosystem

Tools & Frameworks

Labeling

Argilla

Annotation platform. Crowdsourcing, active learning, quality control.

Labeling

Label Studio

Open-source annotation tool. Images, text, audio. Export to standard formats.

Cleaning

Cleanlab

Find mislabeled examples. Estimate label quality. Data cleaning automation.

Versioning

DVC

Data version control. Track datasets like git. Reproducible pipelines.

Hub

Hugging Face Datasets

Dataset versioning, sharing, processing. 10K+ community datasets.

Benchmarking

DataComp

Open source data foundation. Curated datasets for pretraining.

Curation

Dolma

Allen AI's pretraining corpus. 3 trillion tokens. Open for research.

Benchmark

The Stack

3.1M open source repos. Code training data. Source control.

08 — Further Reading

References

Research & Papers

Paper Zhu, X., Lafferty, J., & Rosenfeld, R. (2003). Semi-Supervised Learning With Graphs. CMU. — cmu.edu ↗
Paper Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556. — arxiv:2203.15556 ↗
Paper Northcutt, C. G., et al. (2021). Confident Learning: Estimating Uncertainty in Dataset Labels. arXiv:1911.00068. — arxiv:1911.00068 ↗

Tools & Platforms

Docs Argilla. argilla.io ↗
Docs Label Studio. labelstudio.io ↗
Docs DVC. dvc.org ↗
Docs Cleanlab. cleanlab.ai ↗
Docs Hugging Face Datasets. huggingface.co/datasets ↗

Practitioner Writing

Blog Andrew Ng. A Peek at Trends in Machine Learning. — deeplearning.ai ↗
Blog DataComp. Open Data Foundation for Pretraining. — datacomp.ai ↗

Data-Centric AI

The Data-Centric Shift

Why Data Matters More Now

Scaling Plateau

Diminishing Returns

Cost Structure

Competitive Moat

Data Quality Dimensions

Accuracy

Completeness

Consistency

Coverage

Freshness

Quality Checklist

Curation Pipeline

Workflow

Deduplication & Filtering

MinHash LSH for Near-Duplicates

Perplexity Filtering for Quality

Annotation Best Practices

Annotation Guidelines

Inter-Annotator Agreement

Active Learning Loops

Quality Control

Dataset Management

Versioning with DVC / Hugging Face Datasets

Lineage Tracking

Train / Val / Test Splits

Maintenance

Tools & Frameworks

References

Related concepts