Data Quality

Quality Dimensions
Deduplication
Format & Schema Validation
Coverage Analysis
Outlier Detection
Quality Pipelines

SECTION 01

Quality Dimensions

Completeness: are required fields present? Are there gaps in coverage by topic? Accuracy: is the content factually correct? Does it match the source? Consistency: same facts stated the same way, no contradictions. Freshness: is time-sensitive content up to date? Diversity: balanced coverage across topics, styles, and formats. Relevance: is the content actually useful for the target task?

SECTION 02

Deduplication

Near-duplicate documents inflate training data, cause retrieval redundancy, " "and bias models toward over-represented content. " "Use MinHash LSH for fast near-duplicate detection at scale.

from datasketch import MinHash, MinHashLSH
def build_lsh_index(texts: list[str], threshold: float = 0.85) -> MinHashLSH:
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    for i, text in enumerate(texts):
        m = MinHash(num_perm=128)
        for word in text.lower().split():
            m.update(word.encode())
        lsh.insert(f"doc_{i}", m)
    return lsh
def find_duplicates(texts: list[str], threshold: float = 0.85) -> list[tuple]:
    lsh = build_lsh_index(texts, threshold)
    duplicates = []
    for i, text in enumerate(texts):
        m = MinHash(num_perm=128)
        for word in text.lower().split():
            m.update(word.encode())
        similar = lsh.query(m)
        for match in similar:
            j = int(match.split("_")[1])
            if j != i and (j, i) not in duplicates:
                duplicates.append((i, j))
    return duplicates

SECTION 03

Format & Schema Validation

Validate every record before it enters the training pipeline. " "Check: minimum and maximum text length, required metadata fields present, " "no null or empty values in key fields, text encoding (UTF-8 with no mojibake), " "and no binary or HTML artefacts in the text.

from pydantic import BaseModel, validator
class TrainingRecord(BaseModel):
    text: str
    label: str
    source: str
@validator("text")
    def text_length(cls, v):
        if len(v.split()) < 10:
            raise ValueError("Text too short")
        if len(v) > 100_000:
            raise ValueError("Text too long")
        return v
@validator("text")
    def no_html(cls, v):
        if "<html" in v.lower() or "</div>" in v.lower():
            raise ValueError("HTML artefacts detected")
        return v

SECTION 04

Coverage Analysis

Map your dataset to the intended topic space and identify gaps. Cluster embeddings of your training documents (k-means with k=50–200). Compare cluster sizes: are some topics massively over-represented? Map clusters to topics manually or with an LLM. Under-represented topics predict poor model performance on those topics — collect more data or use synthetic generation to fill gaps.

SECTION 05

Outlier Detection

Use embedding-based outlier detection: compute the mean embedding of your dataset, flag documents more than N standard deviations from the centroid. High-distance documents are often: wrong language, corrupted text, off-topic content, or near-duplicates of very different content. Review a sample of flagged outliers before removing — some are valuable edge cases.

SECTION 06

Quality Pipelines

Automate quality checks as a pre-training pipeline step: dedup → format validation → length filter → language detection → outlier scoring → human review queue. Track per-check rejection rates over time — a sudden increase usually means a data source changed format or started producing low-quality content. Never ship training data that hasn't passed automated quality checks.

SECTION 07

Statistical Profiling & Anomaly Detection

Data quality profiling is foundational. Before training, run statistical audits to detect missing values, outliers, duplicates, and type mismatches. Null rates >10% in a column suggests data collection issues. Outliers (detected via IQR or Z-score) can be valid edge cases or data errors — requires domain knowledge. Duplicates indicate measurement errors or data pipeline bugs. Type mismatches (a numeric column containing strings) break downstream processing. Tools like Great Expectations provide declarative data quality assertions — define once, monitor continuously. The goal is to document baseline quality and establish thresholds (e.g., "we accept <2% nulls; >2% triggers investigation").

Data Quality Issue	Detection Method	Severity	Fix Approach
Missing values	Null counts	Medium	Imputation or removal
Outliers	IQR, Z-score	Medium–High	Cap, remove, or transform
Duplicates	Hash, dedup	High	Remove exact duplicates
Type mismatches	Schema validation	Critical	Coerce or reject

def profile_data_quality(df, target_col=None):
    """Quick data quality audit: nulls, types, duplicates, outliers."""
    import pandas as pd
    import numpy as np
    report = {}
    report['nulls'] = df.isnull().sum()
    report['duplicates'] = df.duplicated().sum()
    for col in df.select_dtypes(include=[np.number]).columns:
        Q1, Q3 = df[col].quantile([0.25, 0.75])
        IQR = Q3 - Q1
        outliers = ((df[col] < (Q1 - 1.5 * IQR)) | (df[col] > (Q3 + 1.5 * IQR))).sum()
        report[f'{col}_outliers'] = outliers
    return report

Data quality issues compound downstream. A 2% null rate might not seem significant, but if you have 10 such columns, ~20% of rows are affected. Outlier handling is context-dependent: in a sales dataset, a $10,000 order is an outlier that's still valid; in a heights dataset, a 300cm tall person is almost certainly a data error. Use domain knowledge: talk to subject matter experts before deciding what's an outlier. Statistical methods (IQR, Z-score) are starting points, not gospel. Duplicates can be exact (identical rows) or near-duplicates (same entity, recorded twice with minor differences). Deduplication is surprisingly hard: is "John Smith" and "Jon Smith" the same person? You typically need record linkage techniques (fuzzy matching, entity resolution). For large datasets, probabilistic deduplication (set a similarity threshold, probabilistically assign matching pairs) works better than all-or-nothing matching.

Data validation prevents downstream surprise. Define a schema (column names, types, ranges) and validate against it. Great Expectations is a popular framework: write expectations like "column age should be between 0 and 150" and check them on every new data batch. Automate drift detection: if the null rate in a column suddenly jumps from 1% to 5%, alert the team. Statistical process control (SPC) charts (plot metrics over time, alert on out-of-control signals) work surprisingly well for automated monitoring. Implement versioning: keep historical data quality snapshots so you can trace when quality degraded. Was it a data collection change? A schema update? A processing bug? Historical data helps you investigate.

Cleaning pipelines are best implemented as reproducible code, not manual steps. If cleaning is done in Excel or a notebook without version control, you can't audit it or reproduce it. Use tools like dbt (SQL-based) or Pandas with documented steps. Test your cleaning code: assert expected invariants after each step (e.g., "after deduplication, every ID appears once"). Keep cleaning separate from transformation: first fix structural issues (missing values, type mismatches), then apply business logic (feature engineering). This separation makes debugging easier. Document why you made each decision: "removed 50 rows with age <0 because age should be non-negative"; "imputed missing income with median because mechanism is missing-at-random". Future maintainers will thank you.

Machine learning on low-quality data is like building on sand. Garbage in, garbage out (GIGO) is an old principle but remains true. However, some models are surprisingly robust to bad data: deep neural networks can learn despite noise if the signal-to-noise ratio is high (e.g., 90% good, 10% bad). Tree-based models (Random Forest, XGBoost) are more robust to outliers than linear models. But this robustness has limits: if >30% of data is corrupted, most models fail. The practical lesson: don't rely on model robustness; clean your data first. Quantify the cost of bad data: how much does training with 10% corrupted data hurt model performance? If it hurts 1% in accuracy, the cost might be acceptable. If it hurts 10%, cleaning becomes critical. Run experiments: train models on clean data, on contaminated data, and on cleaned data; compare. This gives you ROI for cleaning efforts.

Data quality is dynamic: fresh data is often worse than historical data (edge cases, format changes, new anomalies). Implement continuous quality monitoring. Set up alerts for anomalies: if a column's distribution changes (e.g., null rate jumps from 1% to 20%), investigate immediately. Was there a collection change? A pipeline bug? A data format error? Quick response prevents bad data from accumulating. Some organizations assign a "data quality owner" responsible for monitoring and remediating issues. This person is like a software engineer on-call: when quality degrades, they investigate and fix root causes. For large-scale data (billions of records), automated detection (using statistical process control, machine learning anomaly detectors) is necessary. A hybrid approach (automated alerts + human review) catches both obvious and subtle quality issues.

Data quality budgets are a framework for resource allocation. Decide: how much compute/money/time do I invest in cleaning vs. collecting more data? If cleaning is expensive (requires expert review), collecting more raw data might be better ROI. If collection is expensive (medical trials), cleaning existing data is worth it. Budget constraints force trade-offs. A data quality dashboard (showing quality metrics and cost) enables data-driven decisions. "This column has 5% nulls; fixing it costs $2K and improves model accuracy by 1%." Is it worth it? Depends on your margins. For consumer products (high volume, low margins), 1% accuracy improvement might save millions; for specialized services (low volume, high margins), it might not justify the cost. Make these decisions explicitly, not ad-hoc. Data quality is often treated as a side task; treating it as a first-class engineering problem (with budgets, metrics, accountability) improves outcomes and shows ROI to stakeholders.

Data Quality

Table of Contents

Quality Dimensions

Deduplication

Format & Schema Validation

Coverage Analysis

Outlier Detection

Quality Pipelines

Statistical Profiling & Anomaly Detection

Data Cleaning Pipelines

Data Quality

Table of Contents

Quality Dimensions

Deduplication

Format & Schema Validation

Coverage Analysis

Outlier Detection

Quality Pipelines

Statistical Profiling & Anomaly Detection

Data Cleaning Pipelines

Related concepts