Data · Governance

Data Governance for AI

Data lineage, consent tracking, PII handling, retention policies, and audit trails for LLM training and inference

5pillars
7sections
Python-firsttools
Contents
  1. Why governance matters
  2. Five pillars
  3. PII detection
  4. Data lineage
  5. Consent & licensing
  6. Audit logging
  7. References
01 — The Problem

Why Data Governance Matters for AI

LLMs are trained on vast datasets. Many contain PII, copyrighted content, private information, or data from users who never consented. When models regurgitate training data or make decisions based on biased datasets, governance failures compound harm—regulatory, reputational, and legal.

Governance Challenges

💡 Governance isn't compliance theater: Good governance improves model quality (clean data), reduces legal risk, and enables trust. Invest here early.
02 — Framework

The Five Governance Pillars

Pillar What It Covers Tools/Technologies Key Regulation
Lineage Dataset origin, transformations, versions Great Expectations, OpenLineage, dbt Data provenance audit trails
Quality Data completeness, correctness, freshness Great Expectations, Soda, Monte Carlo Data accuracy requirements
Privacy PII detection, redaction, anonymization Presidio, spaCy NER, cryptography GDPR, CCPA, HIPAA
Consent User opt-in/opt-out, licensing compliance Consent management platforms, audit logs GDPR lawful basis, CC licenses
Retention Data deletion, archival, lifecycle policies Data deletion APIs, log rotation, backup management GDPR right-to-erasure, SOC 2
03 — Detection & Redaction

PII Detection and Redaction

PII (personally identifiable information) must be detected and removed before training. Tools like Presidio use entity recognition + pattern matching to find names, emails, SSNs, addresses, etc., then redact or pseudonymize.

Python Example: Presidio Redaction

from presidio_analyzer import AnalyzerEngine from presidio_anonymizer import AnonymizerEngine analyzer = AnalyzerEngine() anonymizer = AnonymizerEngine() text = "John Smith's email is john@example.com. SSN: 123-45-6789" # Detect PII results = analyzer.analyze(text=text, language="en") # Redact redacted = anonymizer.anonymize(text=text, analyzer_results=results) print(redacted.text) # Output: 's email is . SSN:

Common PII Types to Detect

⚠️ PII detection is imperfect: Context matters. "Paris" is a person name AND a city. Always manually validate before training on sensitive data.
04 — Provenance

Data Lineage Tracking

Know where your data came from: which sources, which transformations, which versions. Lineage tracking is critical for compliance audits, reproducibility, and debugging data quality issues.

Lineage Tools

Example: dbt Lineage

-- dbt creates automatic lineage -- models/staging/stg_users.sql {{ config(materialized='table') }} SELECT id, email, created_at, CURRENT_TIMESTAMP as loaded_at FROM {{ source('raw', 'users') }} -- lineage: raw.users WHERE deleted_at IS NULL -- models/marts/fct_users.sql {{ config(materialized='table') }} SELECT * FROM {{ ref('stg_users') }} -- lineage: stg_users → fct_users
06 — Audit Logging

Audit Trails & Immutable Logs

Log who accessed what data when. Audit logs prove compliance: "We can show that user X's data was deleted on date Y." Use immutable logging (append-only) and SIEM integration for security.

Structured Audit Logging (Python)

import structlog import json # Configure structured logging structlog.configure( processors=[ structlog.processors.JSONRenderer() ], context_class=dict, logger_factory=structlog.PrintLoggerFactory(), ) log = structlog.get_logger() # Log data access log.msg( "data_accessed", user_id="user123", dataset_id="dataset_xyz", action="training_query", timestamp="2025-03-24T10:30:00Z", ip="192.168.1.1" ) # Log deletion request log.msg( "deletion_request", user_id="user456", request_id="req_789", status="approved", deletion_date="2025-03-25" )

SIEM Integration

Forward logs to Splunk, ELK, or cloud SIEM (CloudTrail, Azure Monitor) for centralized audit. Enables compliance reporting and breach investigation.

Tools & Platforms

Governance Tools

PII Detection
Microsoft Presidio
Entity recognition + pattern-based PII detection and redaction
NLP
spaCy
Named entity recognition for PII and entities
Lineage
Great Expectations
Data quality + implicit lineage tracking
Lineage
OpenLineage
Open standard for data lineage and observability
Lineage
dbt
SQL transformation lineage and orchestration
Logging
structlog
Structured logging for Python (JSON-formatted audit logs)
SIEM
Splunk
Enterprise log aggregation and security monitoring
Consent
OneTrust
Consent and privacy management platform
07 — Further Reading

References

Documentation
Regulatory & Standards
Research Papers