AI Observability: What to Log, How Long to Keep It, and When to Alert
Observability is the difference between knowing your AI system failed and knowing why it failed, when it started, and which users were affected. PSF-4 defines the minimum logging, alerting, and retention requirements for production AI systems — this guide shows you how to implement them.
The Observability Gap in AI Systems
Traditional software observability (metrics, logs, traces) is necessary but not sufficient for AI systems. A model can be returning valid 200 responses while silently degrading in quality — drifting off-distribution, hallucinating more frequently, producing outputs that are technically correct but semantically wrong. Infrastructure monitoring doesn't catch this. PSF-4 requires you to monitor the intelligence layer, not just the infrastructure layer.
The PSF-4 minimum requirements are:
- Run logging: Every model invocation logged with input hash, output hash, model version, latency, token counts, and cost
- Trace retention: Full prompt+completion traces retained for minimum 30 days; 90 days for regulated use cases
- Alerting: Automated alerts on error rate, latency p99, cost anomaly, and output validation failure rate
- Incident detection: Ability to replay any production run within 24 hours of an incident report
The Minimum Logging Schema
Every model invocation should produce a structured log entry with at minimum:
{
"run_id": "uuid",
"timestamp": "2026-04-30T14:32:00Z",
"model": "gpt-4o-2024-08-06",
"model_version_hash": "sha256:abc123",
"input_hash": "sha256:def456", // hash only — not the raw input if PHI possible
"output_hash": "sha256:ghi789",
"latency_ms": 1243,
"input_tokens": 847,
"output_tokens": 312,
"total_cost_usd": 0.0043,
"validation_passed": true,
"confidence_score": 0.91,
"user_id_hash": "sha256:jkl012", // pseudonymised
"workflow_id": "invoice-approval-v3",
"environment": "production",
"tags": ["billing", "high-value"]
}Alert Thresholds: The Four You Must Have
> 2% over 5-minute window> 5% over 10-minute window> 3× baseline for 5 minutes> 200% of 7-day rolling averageDrift Detection: The Alert You Don't Have But Need
The most dangerous AI failure mode is silent drift — the model starts returning subtly worse outputs without triggering any infrastructure alert. Error rates stay low. Latency is normal. But quality is degrading.
Three practical drift detection approaches:
- Output distribution monitoring: Track the distribution of output classes over time. If a classifier that previously returned 60% positive/40% negative starts returning 90% positive, something has shifted.
- Embedding drift: Compute embeddings of model outputs over time and monitor for distributional shift using tools like Arize Phoenix or Evidently AI. Drift in embedding space precedes observable quality degradation.
- Golden set evaluation: Run a fixed set of representative inputs through the model on every deployment. Track scores against a human-validated ground truth. Regression on the golden set = model degradation.