Shifting focus from model to data — quality over quantity, curation pipelines, and dataset management
Model-centric: Fix the model architecture, optimize hyperparameters, scale training. Assume data is fixed. Data-centric: Fix the model (use standard architecture), optimize the data. Assume data quality is the constraint.
Evidence: 10,000 high-quality examples beat 1 million noisy ones. Chinchilla scaling laws show equal compute between pretraining and data. OpenAI's O1 spends more compute on reasoning (verification) than generation.
Does the label match the example? Image of cat labeled "dog" = error. Typos in text. Misaligned (prompt, response) pairs.
Is all necessary information present? Truncated sentences. Incomplete images. Missing context that makes labels ambiguous.
Are labels consistent across similar examples? Same cat image labeled "cat" once, "animal" another time. Inconsistent annotation guidelines.
Does the dataset represent the distribution you care about? All examples from one domain. Underrepresented edge cases.
Is data up-to-date? Old training data used for current tasks. Concept drift (examples become stale).
| Quality Dimension | Problem | Detection Method | Fix |
|---|---|---|---|
| Correctness | Wrong labels, factual errors | Cross-annotator agreement, LLM audit | Re-label, expert review |
| Completeness | Missing fields, truncated text | Schema validation, length histograms | Filter or impute |
| Consistency | Same input, different labels | Duplicate detection, inter-rater reliability | Adjudication, majority vote |
| Relevance | Off-domain examples polluting set | Embedding clustering, topic modeling | Domain classifier, manual curation |
| Diversity | Over-represented topics/styles | Embedding coverage analysis | Stratified sampling, synthetic data |
import json
from openai import OpenAI
from pydantic import BaseModel, Field
client = OpenAI()
class QualityScore(BaseModel):
score: int = Field(ge=1, le=5, description="Overall quality 1-5")
issues: list[str] = Field(description="List of quality issues found")
keep: bool = Field(description="Whether to keep this example in the dataset")
def score_training_example(input_text: str, output_text: str,
task_description: str) -> QualityScore:
"""Use LLM to score a training example for quality."""
prompt = f"""Evaluate this training example for a {task_description} model.
Input: {input_text[:500]}
Output: {output_text[:500]}
Score on: accuracy, completeness, clarity, and task-relevance.
Score 1=poor/harmful, 3=acceptable, 5=excellent."""
result = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format=QualityScore,
temperature=0.0
)
return result.choices[0].message.parsed
def filter_dataset(examples: list[dict], task: str,
min_score: int = 3) -> list[dict]:
"""Filter a dataset, keeping only high-quality examples."""
filtered = []
for ex in examples:
score = score_training_example(ex["input"], ex["output"], task)
if score.keep and score.score >= min_score:
filtered.append({**ex, "_quality_score": score.score})
else:
print(f"Dropped (score={score.score}): {ex['input'][:50]}... Issues: {score.issues}")
print(f"Kept {len(filtered)}/{len(examples)} examples")
return filtered
Iterate: After first training run, analyze errors. Where does the model fail? Collect more examples for those cases.
Don't just do exact string matching. Similar examples (paraphrases, same fact rephrased) are nearly-duplicates. MinHash Locality Sensitive Hashing efficiently finds them.
Train a language model on high-quality data. Use its perplexity to score new examples. Low perplexity = looks like good data. High perplexity = outlier (possibly noisy).
fastText Quality Classifier: Train a binary classifier (good / bad) on labeled examples. Score new examples. Threshold to filter low-confidence predictions.
from datasketch import MinHash, MinHashLSH
import re
def text_to_shingles(text: str, k: int = 5) -> set[str]:
"""Convert text to k-character shingles for MinHash."""
text = re.sub(r'\s+', ' ', text.lower().strip())
return {text[i:i+k] for i in range(len(text) - k + 1)}
def build_minhash(text: str, num_perm: int = 128) -> MinHash:
m = MinHash(num_perm=num_perm)
for shingle in text_to_shingles(text):
m.update(shingle.encode('utf8'))
return m
def deduplicate_dataset(texts: list[str], threshold: float = 0.8) -> list[int]:
"""Return indices of unique examples after near-duplicate removal."""
lsh = MinHashLSH(threshold=threshold, num_perm=128)
keep = []
for i, text in enumerate(texts):
mh = build_minhash(text)
if not lsh.query(mh): # no near-duplicates found
lsh.insert(f"doc_{i}", mh)
keep.append(i)
return keep
# Example
documents = [
"The quick brown fox jumps over the lazy dog.",
"The quick brown fox jumps over the lazy dog!", # near-duplicate
"A completely different sentence about machine learning.",
"The quick brown fox jumped over the lazy dog.", # near-duplicate
]
unique_indices = deduplicate_dataset(documents, threshold=0.8)
print(f"Kept {len(unique_indices)}/{len(documents)}: {unique_indices}")
# Kept 2/4: [0, 2]
Write clear rubrics. Show examples of edge cases. Define corner cases explicitly. Annotators need clarity.
Have 2–3 people label the same examples. Measure agreement (Cohen's kappa, Krippendorff's alpha). If agreement is low (<0.8), guidelines are unclear. Refine.
Don't label randomly. Train a model on labeled data. Find examples the model is uncertain about. Label those first. Maximizes information per label.
Gold set: Maintain a small set of high-quality, carefully labeled examples. Use to spot-check annotator quality. Attention checks: Include obvious examples. Annotators who fail are unreliable.
DVC (Data Version Control): Track dataset versions like code. Commit dataset versions alongside code. Reproducible experiments.
Hugging Face Datasets: Upload and version datasets. Easy sharing. Built-in filtering and split management.
Record: Where did this data come from? What transformations were applied? Who labeled it? When? Track this in metadata.
Temporal split: Time-series data? Use time as split criterion. Stratified split: Imbalanced classes? Maintain proportions in train/val/test. Domain split: Different domains? Test on unseen domain.
Monitor model performance over time. If it drops, check the data. Distribution shift? New types of queries? Update dataset quarterly.