Data Engineering

Metadata Design

Designing metadata schemas for AI datasets and vector stores β€” the structured information attached to every document chunk that enables filtering, attribution, and retrieval quality improvements.

Retrieval improvement
20–40% with good metadata
Fields
5–15 per document
Storage overhead
<5%

Table of Contents

SECTION 01

Why Metadata Matters

Two chunks may be equally similar to a query by embedding distance, but only one is from a verified source published this year. Without metadata, you can't distinguish them. Good metadata enables: filtered retrieval (only documents from source X), temporal filtering (after date Y), access control (user can only see their tenant's documents), and attribution (show the user which document the answer came from).

SECTION 02

Metadata Taxonomy

Provenance: source URL, document title, author, publication date, version. Structure: chunk index, section heading, page number, document type (PDF/HTML/code). Content: summary, keywords, entity types, language, topic category. Access control: tenant ID, user group, permission level. Quality: confidence score, last verified date, human-reviewed flag.

SECTION 03

Schema Design Principles

Design for query patterns, not for completeness. " "Ask: what filters will users apply? What attribution do they need? " "Keep metadata values low-cardinality for filters (5–20 unique values) " "or store as lists for multi-value attributes. Avoid free-text metadata fields " "that you'll try to filter on β€” normalise to enums or IDs instead.

from pydantic import BaseModel
from datetime import date
from typing import Optional
class ChunkMetadata(BaseModel):
    # Provenance
    source_url: str
    doc_title: str
    author: Optional[str] = None
    published_date: Optional[date] = None
    # Structure
    chunk_index: int
    section: Optional[str] = None
    page_number: Optional[int] = None
    doc_type: str  # "pdf", "html", "markdown", "code"
    # Access
    tenant_id: str
    is_public: bool = False
    # Quality
    human_reviewed: bool = False
SECTION 04

Filtering at Query Time

Use metadata filters to narrow the search space before or alongside vector similarity. " "Most vector stores support hybrid filtering (metadata + vector).

from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
client = QdrantClient("localhost", port=6333)
results = client.search(
    collection_name="documents",
    query_vector=query_embedding,
    query_filter=Filter(
        must=[
            FieldCondition(key="tenant_id", match=MatchValue(value="acme-corp")),
            FieldCondition(key="doc_type", match=MatchValue(value="pdf")),
            FieldCondition(key="published_date",
                           range=Range(gte="2024-01-01")),
        ]
    ),
    limit=10,
)
SECTION 05

Embedding Metadata

For richer semantic retrieval, embed metadata alongside content: prepend document title, section heading, and keywords to the chunk text before embedding. '[Title: Annual Report 2024] [Section: Revenue] Q4 revenue was $1.2B...' This shifts the embedding toward the document's semantic context, improving retrieval for queries like 'what were Q4 revenues?'

SECTION 06

Tooling & Standards

LangChain Document: standard metadata dict attached to every Document object. LlamaIndex: rich metadata with per-field embedding control. Unstructured.io: auto-extracts metadata from PDFs, HTMLs, and Office docs. Dublin Core: standard for provenance metadata (title, creator, date, format). Keep metadata versioned alongside the corpus β€” schema changes require re-ingestion.

SECTION 07

Schema Design Patterns

Metadata design starts with purpose: Why are we tracking this metadata? What queries will we need to answer? Common metadata includes provenance (source, timestamp, author), quality signals (annotation confidence, model score), semantic tags (domain, task type, language), and lineage (how was this example generated?). Store metadata alongside data (same JSON record) rather than separate tables β€” this ensures they don't drift out of sync. Use schema versioning: as you add new metadata fields, version your schema and handle backwards compatibility. Document the meaning and units of each field; metadata without context becomes technical debt.

Metadata TypeStorage FormatQuery PatternExample
Document provenanceJSON in recordFilter by sourcesource, timestamp, author
Quality scoresNumeric fieldSort/range filterbleu_score, fluency_1_5
Categorical tagsArray/SetMulti-filter["toxicity", "bias"]
LineageNested JSONTraverse graphparent_id, transformations
def enrich_metadata(record: dict, model_name: str) -> dict:
    """Add computed metadata: quality scores, embeddings, tags."""
    import hashlib
    record['metadata'] = {
        'source': record.get('source', 'unknown'),
        'timestamp': datetime.now().isoformat(),
        'model_version': model_name,
        'hash': hashlib.md5(str(record).encode()).hexdigest(),
        'quality_score': compute_quality(record),
        'embedding': embed(record['text']),
        'tags': infer_tags(record['text'])
    }
    return record
SECTION 08

Metadata Storage & Retrieval

Metadata storage has trade-offs: embedded in records (fast queries, tightly coupled), in a relational database (flexible, scales to millions of queries), or in a document store (semi-structured, good for heterogeneous metadata). For datasets >100K examples, consider indexed backends (Elasticsearch, PostgreSQL) to enable efficient filtering and aggregation. Implement metadata retrieval patterns: (1) Extract by field (all quality scores >0.8), (2) Filter by tag (all examples with toxicity tag), (3) Traverse lineage (find the original source of a derived example), (4) Temporal queries (all data added in March 2026). Lazy evaluation (compute metadata on-demand) works for small datasets; large pipelines need precomputed metadata.

Metadata is often an afterthought but becomes critical at scale. Start simple (source, timestamp, author), then iterate based on what you actually need to query. As your dataset evolves, you'll discover new metadata needs: "which examples came from this data source?" or "what was the annotation quality for this example?" If metadata isn't tracked from the start, you're stuck inferring it retroactively (slow, error-prone). Implement a metadata schema as part of your data pipeline: before data reaches your ML system, enrich it with computed metadata (quality scores, embeddings, tags). Make metadata queryable: support filtering ("show me high-confidence labels"), aggregation ("average quality score by source"), and joins ("annotator who labeled this example").

Data lineage is metadata for data: tracking where an example came from, what transformations it underwent, and where it ended up. For reproducibility and debugging, lineage is invaluable. If a model makes a bad prediction, tracing back to the source training data reveals whether the issue is data quality, labeling error, or model problem. Lineage is also regulatory: GDPR's "right to be forgotten" requires knowing which models were trained on a person's data. Implementations range from simple (log file paths in a README) to sophisticated (semantic lineage graphs showing data flow through the system). DAGs (directed acyclic graphs) in tools like Airflow provide lineage for pipelines; embedding versioning (track which version of embeddings each record used) provides lineage for model inputs.

Privacy and metadata interact in subtle ways. If you store metadata like "user_id", that's PII and needs protection. Some teams separate metadata from data (store data in a data lake, metadata in a separate system with access controls). Others encrypt sensitive metadata. Metadata governance is important: who can see quality scores? Can anyone see which annotator labeled an example? (Noβ€”that's annotator privacy.) Define access levels: "researchers can see aggregate quality statistics but not per-example scores." Implement this with database views or application-level filtering. As your system matures, invest in a metadata platform (metadata catalog, lineage tool, quality monitoring)β€”it multiplies the value of your data investment.

Standardization helps metadata be useful. Define a metadata schema, document it, and require all data producers to follow it. If every team has its own "quality score" (different scales, different meanings), comparing across teams becomes impossible. Standards like Dublin Core (for bibliographic metadata) or schema.org (for web data) provide templates. For domain-specific data (medical imaging), industry consortia often define standards (DICOM for radiology). Adopting standards is overhead initially but pays dividends: tools built for standard metadata work across datasets, teams can share data easily, and onboarding newcomers is faster. Version standards as your needs evolve: v1.0 has fields A, B, C; v2.0 adds D and E. Old data can be upgraded to v2.0 via migration scripts. Tools like Avro and Protocol Buffers provide versioning built-in.

Metadata governance is the policy side: who can modify metadata? Who can add new fields? Are breaking changes (removing a field) allowed? A permissive approach (anyone can add metadata) is flexible but can lead to chaos (inconsistent naming, unused fields). A restrictive approach (only the data governance team can modify schema) is slow but ensures consistency. A balanced approach: a central governance board reviews proposed changes; changes are approved if they have clear use cases and owner commitment. Metadata stewardship is important: each metadata field should have an owner (person or team responsible for it). This ensures fields don't bitrot. A living metadata catalog (searchable, linked to datasets) is invaluable: where is the "quality_score" used? Which teams compute it? What's the current version? A tool like Apache Atlas or custom solutions provide this visibility.

Metadata debt is real. Every metadata field you add is technical debt: you need to maintain it, migrate it across versions, and explain it to new users. Over-engineering metadata leads to sprawl: fields no one uses, inconsistent naming, unused tags. Start minimal (just source and timestamp), add fields only when you have concrete use cases. Retire fields that aren't queried; don't keep them "just in case." Regular audits (quarterly review of metadata usage) prevent bloat. For each field, ask: who uses this? Is it accurate? Can we delete it? Metadata lifecycle management (create, use, deprecate, remove) keeps systems healthy. Some organizations freeze their metadata schema after initial design, accepting it's suboptimal; others evolve it constantly. A middle ground: controlled evolution (changes require justification and review) prevents chaos while remaining flexible.

Metadata extraction can be automated. For provenance, log sources automatically. For quality scores, compute on-the-fly during data ingestion. For semantic tags, use NLP models (topic modeling, entity extraction). For embeddings, precompute and cache. Automation reduces human effort and prevents inconsistency. But automation isn't perfect; errors in extracted metadata propagate downstream. Validate automated metadata: sample and manually check. Set aside a small fraction of manual effort (5-10%) for validation and correction. This hybrid approach (automation + sampling validation) scales well. As your automation improves, reduce validation; if quality degrades, increase validation. Treat metadata extraction as an ongoing calibration problem.