Immutable, append-only logs of every LLM interaction, tool call, and decision in an AI system — for debugging, compliance, cost attribution, and retroactive evaluation.
Log everything that affects an LLM decision: full prompt (system + user + history), model response (full text, not truncated), model name and version, temperature and sampling parameters, token counts, latency, cost, tool calls and their results, user ID, session ID, and request ID. Log at millisecond precision for latency analysis.
Audit logs must be append-only and tamper-evident. Use write-once storage: S3 with Object Lock, Postgres with row-level security blocking DELETEs, or a dedicated audit log service (Datadog, Splunk). Never modify or delete audit records — retain errors and corrections as new log entries.
Store as structured JSON with an event envelope for queryability.
from pydantic import BaseModel
from datetime import datetime
from typing import Any
class AuditEvent(BaseModel):
event_id: str # UUID
timestamp: datetime
event_type: str # "llm_request", "tool_call", "decision"
request_id: str # Traces all events for one user request
session_id: str
user_id: str
tenant_id: str
# LLM-specific fields
model: str = ""
prompt_tokens: int = 0
completion_tokens: int = 0
cost_usd: float = 0.0
latency_ms: float = 0.0
# Payload (encrypted if sensitive)
payload: dict[str, Any] = {}
# Metadata
app_version: str = ""
environment: str = "production"
PII in prompts (names, emails, SSNs) requires special handling. Options: (1) redact before storage — use a PII detector (Presidio) to replace PII with tokens. (2) encrypt at rest with customer-managed keys (CMK) — store encrypted, decrypt on authorised access. (3) hash for identity without storing raw values. Never store API keys, passwords, or payment card data in audit logs.
Index on: request_id (trace reconstruction), user_id (per-user cost), timestamp (time-range queries), event_type (filter by type), and cost_usd (top spenders). Use columnar storage (Parquet on S3 + Athena, or ClickHouse) for analytical queries. Example: 'show me all requests from user X in the past 7 days that cost more than $0.10'.
GDPR: audit log proves when data was processed, by which model, for what purpose. SOC 2: audit trail demonstrates access controls and data handling policies. Healthcare (HIPAA): logs of PHI processing must be retained for 6 years. Financial services: full audit trail of AI-assisted decisions is often legally required. Design with compliance requirements in mind from day one — retrofitting is painful.
A production audit trail system requires append-only storage, immutable records, and fast retrieval. PostgreSQL with write-once semantics (via CHECK constraints and triggers) works well; specialized systems like ClickHouse or cloud audit services (AWS CloudTrail, GCP Audit Logs) handle massive scale automatically.
# PostgreSQL audit trail schema
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
event_type VARCHAR(64) NOT NULL,
actor_id UUID NOT NULL,
resource_id UUID,
change_before JSONB,
change_after JSONB,
timestamp TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT CURRENT_TIMESTAMP,
ip_address INET,
user_agent TEXT
);
-- Immutable: enforce no deletes/updates
CREATE TRIGGER audit_immutable BEFORE UPDATE OR DELETE ON audit_log
FOR EACH ROW EXECUTE FUNCTION raise_immutable_error();Audit logs must be queryable by time range, actor, resource, and event type. Set up regular compliance reports that summarize deletions, permission changes, and sensitive data access. Common queries include "who accessed customer data in the last 7 days?" and "what changed on this configuration file in Q4?"
# Query audit trail for compliance report
SELECT actor_id, COUNT(*) as action_count, MAX(timestamp) as last_action
FROM audit_log
WHERE event_type IN ('DELETE', 'UPDATE')
AND resource_id = 'prod-config'
AND timestamp > NOW() - INTERVAL '90 days'
GROUP BY actor_id
ORDER BY action_count DESC;Compliance frameworks (SOC 2, HIPAA, PCI-DSS) mandate audit log retention: typically 1–7 years depending on industry and geography. Hot storage (DB) for recent logs, cold storage (S3/GCS) for archives. Legal holds can flag ranges of logs for indefinite retention if litigation or investigation is underway.
| Retention Tier | Duration | Storage Medium | Access Latency | Cost/Month |
|---|---|---|---|---|
| Hot (Recent) | 90 days | PostgreSQL / CloudSQL | <10ms | $200–500 |
| Warm (Archive) | 1 year | S3 Standard / GCS Standard | 1–2s | $50–100 |
| Cold (Long-term) | 7 years | Glacier / Nearline (compressed) | 1–12h | $5–20 |
| Legal Hold | Indefinite | Immutable (e.g., WORM) | 1–2s | +50% |
Critical Implementation Detail: Never trust client-supplied timestamps in audit logs. Use server-side, UTC-normalized timestamps only. Attackers can tamper with client time; server-side timestamps are authoritative and harder to spoof. Hash audit records and store hashes in append-only storage (blockchain-style) if you need cryptographic proof of tampering.
For compliance audits, prepare a monthly report showing new log retention policies, archival activities, and any purges (with legal justification). Many compliance failures occur not because logs aren't kept, but because the organization can't prove they were kept intact and were not altered. Document your entire audit pipeline clearly.
Audit Log Encryption & Tamper Detection: Audit logs are high-value targets for attackers; compromise an audit log, and you can hide your tracks. Always encrypt audit logs at rest (database-level encryption, S3 encryption, or filesystem encryption). For critical events (data deletion, permission changes), add cryptographic signatures: compute a hash of the event + previous event hash (blockchain-style), so any tampering is detectable. Use a hardware security module (HSM) to sign logs, making tampering even harder. For compliance, demonstrate that logs haven't been tampered with by publishing digests in external systems (e.g., send audit log hashes to a syslog server outside your infrastructure).
Retention policies vary by regulation: GDPR (right to erasure) conflicts with SOC 2 (retain audit logs). Implement a dual system: audit logs for compliance (never deleted), and personal data logs (redacted after retention period). Use PII masking in audit logs where possible: instead of logging "deleted user email@example.com", log "deleted user [MASKED_EMAIL]" or use hashed identifiers. This balances audit trail requirements with privacy obligations.
Monitoring and observability are essential for production systems. Set up comprehensive logging at every layer: API requests, model predictions, database queries, cache hits/misses. Use structured logging (JSON) to enable filtering and aggregation across thousands of servers. For production deployments, track not just errors but also latency percentiles (p50, p95, p99); if p99 latency suddenly doubles, something is wrong even if error rates are normal. Set up alerting based on SLO violations: if a service is supposed to have 99.9% availability and it drops to 99.5%, alert immediately. Use distributed tracing (Jaeger, Lightstep) to track requests across multiple services; a slow end-to-end latency might be hidden in one deep service call, invisible in aggregate metrics.
For long-running ML jobs (training, batch inference), implement checkpoint recovery and graceful degradation. If a training job crashes after 2 weeks, you want to resume from the last checkpoint, not restart from scratch. Implement job orchestration with Kubernetes or Airflow to handle retries, resource allocation, and dependency management. Use feature flags for safe deployment: deploy new model versions behind a flag that's off by default, gradually roll out to 1% of users, 10%, then 100%, monitoring metrics at each step. If something goes wrong, flip the flag back instantly. This approach reduces risk and enables fast rollback.
Finally, build a culture of incident response and post-mortems. When something breaks (and it will), document the incident: timeline, root cause, mitigation steps, and preventive measures. Use incidents as learning opportunities; blameless post-mortems focus on systems, not people. Share findings across teams to prevent repeat incidents. A well-documented incident history is an organization's institutional knowledge about system failures and how to avoid them.
The rapid evolution of AI infrastructure requires continuous learning and adaptation. Teams should establish regular tech talks and knowledge-sharing sessions where engineers present lessons learned from production deployments, performance optimization work, and incident postmortems. Create internal wiki pages documenting best practices specific to your organization: how to debug common failure modes, performance tuning guides for your hardware, and checklists for safe deployments. This prevents repeating mistakes and accelerates onboarding of new team members.
Build relationships with vendors and open-source communities. If you encounter bugs in frameworks (PyTorch, JAX), file detailed reports. If you have questions, ask on forums; community members often have encountered similar issues. For mission-critical infrastructure, consider purchasing support contracts with vendors (PyTorch, HuggingFace, cloud providers). Support gives you direct access to engineers who understand your system and can prioritize fixes. This is insurance against production outages caused by third-party software bugs.
Finally, remember that optimization is a journey, not a destination. Today's cutting-edge technique becomes tomorrow's baseline. Allocate 10-15% of engineering time to exploration and experimentation. Some experiments will fail, but successful ones compound into significant efficiency gains. Foster a culture of continuous improvement: measure, analyze, iterate, and share results. The teams that stay ahead are those that invest in understanding their systems deeply and adapting proactively to new technologies and changing demands.
Key Takeaway: Success in GenAI infrastructure depends on mastering fundamentals: understand your hardware constraints, profile your workloads, measure everything, and iterate. The most sophisticated techniques (dynamic batching, mixed precision, distributed training) build on solid foundations of clear thinking and empirical validation. Avoid cargo-cult engineering: if you don't understand why a technique helps your specific use case, it probably won't. Invest time in understanding root causes, not just applying trendy solutions. Over time, this rigor will compound into significant competitive advantage.