01 — The Problem
Why Data Governance Matters for AI
LLMs are trained on vast datasets. Many contain PII, copyrighted content, private information, or data from users who never consented. When models regurgitate training data or make decisions based on biased datasets, governance failures compound harm—regulatory, reputational, and legal.
Governance Challenges
- Training data provenance: Who owns the training data? Was it collected with consent?
- Inference PII leakage: Can users extract training data by prompting?
- GDPR compliance: Can you honor data deletion requests after training?
- Copyright issues: Was the training data legally obtained?
- Bias & fairness: Do training data distributions favor/harm specific groups?
💡
Governance isn't compliance theater: Good governance improves model quality (clean data), reduces legal risk, and enables trust. Invest here early.
02 — Framework
The Five Governance Pillars
| Pillar |
What It Covers |
Tools/Technologies |
Key Regulation |
| Lineage |
Dataset origin, transformations, versions |
Great Expectations, OpenLineage, dbt |
Data provenance audit trails |
| Quality |
Data completeness, correctness, freshness |
Great Expectations, Soda, Monte Carlo |
Data accuracy requirements |
| Privacy |
PII detection, redaction, anonymization |
Presidio, spaCy NER, cryptography |
GDPR, CCPA, HIPAA |
| Consent |
User opt-in/opt-out, licensing compliance |
Consent management platforms, audit logs |
GDPR lawful basis, CC licenses |
| Retention |
Data deletion, archival, lifecycle policies |
Data deletion APIs, log rotation, backup management |
GDPR right-to-erasure, SOC 2 |
03 — Detection & Redaction
PII Detection and Redaction
PII (personally identifiable information) must be detected and removed before training. Tools like Presidio use entity recognition + pattern matching to find names, emails, SSNs, addresses, etc., then redact or pseudonymize.
Python Example: Presidio Redaction
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
text = "John Smith's email is john@example.com. SSN: 123-45-6789"
# Detect PII
results = analyzer.analyze(text=text, language="en")
# Redact
redacted = anonymizer.anonymize(text=text, analyzer_results=results)
print(redacted.text)
# Output:
's email is . SSN:
Common PII Types to Detect
- Names: Person names (spaCy PERSON, NER models)
- Contact: Email, phone numbers (regex + validation)
- Identifiers: SSN, credit card, passport numbers (pattern matching)
- Location: Addresses, coordinates, workplace
- Sensitive: Health info, financial records, biometrics
⚠️
PII detection is imperfect: Context matters. "Paris" is a person name AND a city. Always manually validate before training on sensitive data.
04 — Provenance
Data Lineage Tracking
Know where your data came from: which sources, which transformations, which versions. Lineage tracking is critical for compliance audits, reproducibility, and debugging data quality issues.
Lineage Tools
- OpenLineage: Open standard for data lineage (multiple tools integrate)
- dbt: SQL transformation lineage, native lineage visualization
- Airbyte: Data pipeline lineage (sources → transformations → warehouses)
- Great Expectations: Data quality + implicit lineage (expectations per dataset version)
Example: dbt Lineage
-- dbt creates automatic lineage
-- models/staging/stg_users.sql
{{ config(materialized='table') }}
SELECT
id,
email,
created_at,
CURRENT_TIMESTAMP as loaded_at
FROM {{ source('raw', 'users') }} -- lineage: raw.users
WHERE deleted_at IS NULL
-- models/marts/fct_users.sql
{{ config(materialized='table') }}
SELECT * FROM {{ ref('stg_users') }} -- lineage: stg_users → fct_users
05 — Consent & Licensing
Consent Management & Data Licensing
GDPR requires lawful basis for training. Common bases: consent (opt-in), legitimate interest, contract, legal obligation. For public data, respect CC license terms (CC-BY requires attribution; CC-NC forbids commercial use).
GDPR Lawful Bases for Training
- Consent: User explicitly opts in to training use
- Legitimate interest: Organization has interest (e.g., customer service improvement) that overrides user privacy
- Contract: Training is necessary for contractual obligation
- Legal obligation: Law requires training (rare)
For copyrighted content, respect author licensing. If data has CC-BY-SA license, derived models must use compatible license.
💡
Best practice: Document lawful basis for each dataset. Include in model cards (see model-governance concept).
06 — Audit Logging
Audit Trails & Immutable Logs
Log who accessed what data when. Audit logs prove compliance: "We can show that user X's data was deleted on date Y." Use immutable logging (append-only) and SIEM integration for security.
Structured Audit Logging (Python)
import structlog
import json
# Configure structured logging
structlog.configure(
processors=[
structlog.processors.JSONRenderer()
],
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
)
log = structlog.get_logger()
# Log data access
log.msg(
"data_accessed",
user_id="user123",
dataset_id="dataset_xyz",
action="training_query",
timestamp="2025-03-24T10:30:00Z",
ip="192.168.1.1"
)
# Log deletion request
log.msg(
"deletion_request",
user_id="user456",
request_id="req_789",
status="approved",
deletion_date="2025-03-25"
)
SIEM Integration
Forward logs to Splunk, ELK, or cloud SIEM (CloudTrail, Azure Monitor) for centralized audit. Enables compliance reporting and breach investigation.
Tools & Platforms
Governance Tools
07 — Further Reading
References
Documentation
Regulatory & Standards
Research Papers