Formal agreements between data producers and consumers that define schema, quality guarantees, SLAs, and ownership — preventing the silent breaking changes that corrupt AI pipelines.
A data contract is a versioned, machine-readable specification between a data producer (the team or pipeline that generates data) and data consumers (models, pipelines, applications that depend on it). It says: 'I guarantee this table will have these columns, with these types, meeting these quality rules, refreshed on this schedule.' Breaking the contract requires a version bump and consumer notification.
A data contract typically includes: dataset identifier and version, " "owner and steward contact, schema (column names, types, nullability), " "quality assertions (row counts, null rates, value ranges), " "SLA (freshness guarantee, availability target), and deprecation policy.
# Example data contract in YAML
DATA_CONTRACT = """
id: com.example.user_events
version: 2.1.0
owner: data-platform@example.com
schema:
- name: user_id
type: string
nullable: false
description: Unique user identifier
- name: event_type
type: string
enum: [click, view, purchase, logout]
- name: timestamp
type: timestamp
nullable: false
quality:
- type: row_count_minimum
value: 1000
window: 1h
- type: null_rate_maximum
column: user_id
value: 0.0
sla:
freshness_minutes: 15
availability_percent: 99.9
"""
Validate incoming data against the contract schema before it enters downstream pipelines. " "Reject or quarantine records that don't match; never silently coerce types.
import pandas as pd
from pydantic import BaseModel, validator
from typing import Literal
class UserEvent(BaseModel):
user_id: str
event_type: Literal["click", "view", "purchase", "logout"]
timestamp: str
@validator("user_id")
def not_empty(cls, v):
if not v.strip():
raise ValueError("user_id cannot be empty")
return v
def validate_batch(df: pd.DataFrame) -> tuple[pd.DataFrame, pd.DataFrame]:
valid_rows, invalid_rows = [], []
for _, row in df.iterrows():
try:
UserEvent(**row.to_dict())
valid_rows.append(row)
except Exception as e:
invalid_rows.append({**row.to_dict(), "error": str(e)})
return pd.DataFrame(valid_rows), pd.DataFrame(invalid_rows)
Quality assertions are testable rules that run after each data refresh: row count is between X and Y, null rate for column Z is below 1%, all values in column W are within a valid set, no duplicate primary keys. Tools: Great Expectations (Python), dbt tests (SQL), Soda Core (YAML-driven). Failed assertions block downstream pipeline stages and trigger alerts.
Every data asset should have a named owner (team or individual) who is accountable for the SLA. Freshness SLA: data must be updated within N minutes of source event. Availability SLA: pipeline succeeds at least X% of scheduled runs. Track SLA compliance over a rolling 30-day window and alert owners when compliance drops below target.
Soda Core: YAML-defined quality checks that run against any database or file. Great Expectations: Python-first, rich assertion library, generates data docs. dbt tests: SQL-native tests that run as part of dbt build pipelines. Data Contract CLI (open-source): validate YAML contracts and generate test suites. Start simple: even a manually-reviewed YAML file checked into git is better than no contract.
Data contracts can be enforced at multiple layers: at the data source (schema validation), in ETL pipelines (quality gates), at consumption time (type checking), and through monitoring dashboards (SLA tracking). Each layer adds redundancy and catches different failure modes.
# Contract enforcement at pipeline entry
from great_expectations.core.batch import RuntimeBatchRequest
context = ge.get_context()
batch_req = RuntimeBatchRequest(
datasource_name="my_datasource",
data_connector_name="runtime",
data_asset_name="user_events",
runtime_parameters={"path": "/data/events.parquet"}
)
validator = context.get_validator(
batch_request=batch_req,
expectation_suite_name="user_events_suite"
)
validator.validate()Contract versioning is critical: when you change the contract, old consumers might still expect the old schema. Use semantic versioning: MAJOR for breaking changes (column removal), MINOR for additions (new optional column), PATCH for metadata changes. Require active agreement from consumers before deploying breaking changes.
# Contract versioning and migration
contract_versions = {
"v1": {"columns": ["user_id", "event_type"]},
"v2": {"columns": ["user_id", "event_type", "timestamp"]}, # MINOR: added timestamp
"v3": {"columns": ["user_id", "event_type", "timestamp", "session_id"]} # MINOR: added session_id
}
# Breaking change: removing event_type requires v4 MAJOR bump
# Plan: 30-day deprecation period, notify all consumers| Layer | Responsibility | Tool Example |
|---|---|---|
| Source | Schema validation, type enforcement | Database constraints, Avro schema |
| ETL | Quality assertions, row-level rules | dbt tests, Soda Core |
| Consumption | Type safety, runtime checks | Pydantic, TypeScript types |
| Monitoring | SLA compliance, freshness alerts | Datadog, Prometheus |
At scale, data contract enforcement becomes infrastructure. Netflix publishes data contracts as part of their metadata service: every table has a published contract that tools can query. Consumers run contract validators before touching data. When contracts break, the system quarantines data and creates a ticket automatically. This prevents silent data quality issues from propagating downstream. The investment pays off: one broken contract that would have created 2 days of debugging is caught and blocked in seconds.
Contract versioning strategies: semantic versioning works well (v1.0 → v1.1 for non-breaking, v2.0 for breaking changes). Alternative: immutable contracts (never change existing contracts, create new ones instead). This is simpler operationally but leads to contract sprawl. Hybrid: minimal immutable contracts for critical pipelines, flexible contracts for exploratory data. The key is intentionality: version changes should be deliberate, reviewed, and communicated.
Data contracts succeed when teams adopt them incrementally. Start by documenting one critical pipeline: define its output schema, quality rules, and SLA. Make that contract non-negotiable—all consumers use it, all producers respect it. Once that works, expand to adjacent pipelines. GitHub has documented their journey: they went from ad-hoc data changes breaking dozens of downstream systems to zero-friction data evolution through formalized contracts.
The contracts themselves evolve. Version 1: YAML files in a git repo, manually updated. Version 2: auto-generated from data sources (e.g., dbt models auto-export their contracts). Version 3: runtime enforcement in data pipelines with automatic discovery. Teams at scale (Uber, Airbnb) operate at version 3: contracts are living documents enforced by infra, updated automatically as schemas change, with breaking change detection built-in.
Common pitfalls: over-specification (contracts that are too strict, requiring constant updates), no enforcement (contracts exist but nobody validates them), and silent failures (broken contracts aren't detected). Solutions: keep contracts minimal (only enforce what matters), automate validation (make it frictionless), and alert visibly (break the build, don't silently log).
Breaking changes: removing a column, renaming a column, changing type (string → int), narrowing allowed values (enum ["a","b","c"] → ["a","b"]). These break consumers. Non-breaking changes: adding an optional column, widening allowed values (["a","b"] → ["a","b","c"]), adding a new enum option. Document the matrix clearly. Enforce via CI: reject any pull request that changes a contract in a breaking way without explicit version bump.
Migration strategy for breaking changes: announce 30-60 days in advance. Create a new contract version (v2.0) alongside the old one (v1.5). Consumers migrate gradually. Set a deprecation date after which v1.5 is no longer supported. On the deprecation date, remove v1.5 and keep only v2.0. Gradual migration prevents production breakage. Immediate switchover is risky—some consumers will miss the deadline.
Contract testing frameworks: define unit tests for contracts themselves. Test that your validator correctly rejects invalid records (missing required fields, wrong types) and accepts valid ones. Test that schema evolution is handled correctly. Use pytest or similar to ensure contracts stay valid. These tests are as important as code tests—broken contracts are as bad as broken code.
Contract documentation: each contract should include examples. Show what valid data looks like. Show edge cases. This makes consumers understand the contract quickly without reading the spec. Auto-generate documentation from contracts: if contracts are YAML, generate readable HTML docs automatically. This ensures docs stay in sync with the actual contract.
Best practices: write simple contracts that focus on what matters. Over-specify and you'll constantly update it. Under-specify and consumers make wrong assumptions. Include schema, quality assertions, SLA, and policy. Evolve intentionally. Distribute systematically. Validate ruthlessly. This discipline prevents silent data corruption in production AI pipelines.
Contract governance: establish clear ownership and review processes. Some teams have review boards, others use lightweight models where producers own contracts and consumers approve changes. The governance structure should match organizational maturity. Startups can use informal processes. Large enterprises need formal governance. Establish clear ownership early to avoid confusion.
Key takeaway: the value of this approach compounds over time. In month one, the benefits might be marginal. In month six, dramatically apparent. In year two, transformative. This is why patience and persistence matter in technical implementation. Build strong foundations, invest in quality, and let the benefits accumulate. The teams that master these techniques gain compounding advantages over competitors. Start today, measure continuously, optimize based on data. Success follows from consistent execution of fundamentals.
Implementation roadmap: phase one establish a single critical contract and make it non-negotiable. Phase two expand to adjacent pipelines. Phase three add automation and tooling. Phase four enable self-service contract management. Most organizations stall at phase two because the overhead feels high relative to benefits. Push through. The ROI compound increases dramatically once you have multiple contracts and automated enforcement.