Insights›PSF Domain 4›Implementation Guide

PSF Deep Dive · D4 Observability

AI Observability: What to Log, How Long to Keep It, and When to Alert

Observability is the difference between knowing your AI system failed and knowing why it failed, when it started, and which users were affected. PSF-4 defines the minimum logging, alerting, and retention requirements for production AI systems — this guide shows you how to implement them.

The Observability Gap in AI Systems

Traditional software observability (metrics, logs, traces) is necessary but not sufficient for AI systems. A model can be returning valid 200 responses while silently degrading in quality — drifting off-distribution, hallucinating more frequently, producing outputs that are technically correct but semantically wrong. Infrastructure monitoring doesn't catch this. PSF-4 requires you to monitor the intelligence layer, not just the infrastructure layer.

The PSF-4 minimum requirements are:

Run logging: Every model invocation logged with input hash, output hash, model version, latency, token counts, and cost
Trace retention: Full prompt+completion traces retained for minimum 30 days; 90 days for regulated use cases
Alerting: Automated alerts on error rate, latency p99, cost anomaly, and output validation failure rate
Incident detection: Ability to replay any production run within 24 hours of an incident report

The Minimum Logging Schema

Every model invocation should produce a structured log entry with at minimum:

{
  "run_id": "uuid",
  "timestamp": "2026-04-30T14:32:00Z",
  "model": "gpt-4o-2024-08-06",
  "model_version_hash": "sha256:abc123",
  "input_hash": "sha256:def456",  // hash only — not the raw input if PHI possible
  "output_hash": "sha256:ghi789",
  "latency_ms": 1243,
  "input_tokens": 847,
  "output_tokens": 312,
  "total_cost_usd": 0.0043,
  "validation_passed": true,
  "confidence_score": 0.91,
  "user_id_hash": "sha256:jkl012",  // pseudonymised
  "workflow_id": "invoice-approval-v3",
  "environment": "production",
  "tags": ["billing", "high-value"]
}

PHI/PII note: Never include raw model inputs or outputs in your logging schema if they may contain personal data. Log hashes for correlation; store full traces in a separate, access-controlled trace store with appropriate retention policies.

Alert Thresholds: The Four You Must Have

Error rate> 2% over 5-minute window

Action: PagerDuty/on-call alert

Why: Sudden error spike = model API issue, validation failure spike, or upstream data problem

Output validation failure rate> 5% over 10-minute window

Action: Engineering alert + auto-fallback trigger

Why: Model drifting off schema = prompt regression, model version change, or distribution shift

Latency p99> 3× baseline for 5 minutes

Action: Infrastructure alert

Why: Latency spikes cause timeout cascades in downstream systems

Cost per hour> 200% of 7-day rolling average

Action: Engineering + finance alert

Why: Cost anomaly = runaway loops, abuse, or accidental production of very long prompts

Drift Detection: The Alert You Don't Have But Need

The most dangerous AI failure mode is silent drift — the model starts returning subtly worse outputs without triggering any infrastructure alert. Error rates stay low. Latency is normal. But quality is degrading.

Three practical drift detection approaches:

Output distribution monitoring: Track the distribution of output classes over time. If a classifier that previously returned 60% positive/40% negative starts returning 90% positive, something has shifted.
Embedding drift: Compute embeddings of model outputs over time and monitor for distributional shift using tools like Arize Phoenix or Evidently AI. Drift in embedding space precedes observable quality degradation.
Golden set evaluation: Run a fixed set of representative inputs through the model on every deployment. Track scores against a human-validated ground truth. Regression on the golden set = model degradation.

Observability Tool Selection

Langfuse

Open source / Cloud

Full comparison →

Strengths

✓ Full trace capture

✓ Prompt versioning

✓ Dataset management

✓ Cost tracking per model

Limitations

✗ Self-hosted complexity

✗ No native alerting

PSF: Strong D4, Partial D3 (PHI in traces)

LangSmith

LangChain ecosystem

Full comparison →

Strengths

✓ Native LangChain integration

✓ Run tree visualisation

✓ Dataset/eval support

✓ Annotation queues

Limitations

✗ LangChain-centric

✗ Expensive at scale

PSF: Strong D4, configure trace retention

Arize Phoenix

Open source

Full comparison →

Strengths

✓ Embedding drift detection

✓ RAG retrieval tracing

✓ OpenTelemetry native

✓ Hallucination detection

Limitations

✗ Less mature UI

✗ Community support

PSF: Strong D4 + D2 (drift/quality)

Helicone

Cloud proxy

Full comparison →

Strengths

✓ Zero-code integration

✓ Cost analytics

✓ Rate limiting

✓ Caching

Limitations

✗ Proxy adds latency

✗ Limited trace depth

PSF: Good for D4 cost/volume metrics

AI Observability: What to Log, How Long to Keep It, and When to Alert

The Observability Gap in AI Systems

The Minimum Logging Schema

Alert Thresholds: The Four You Must Have

Drift Detection: The Alert You Don't Have But Need

Observability Tool Selection

Related guides

You understand the gaps.
Get the credential that proves it.

AI Observability: What to Log, How Long to Keep It, and When to Alert

The Observability Gap in AI Systems

The Minimum Logging Schema

Alert Thresholds: The Four You Must Have

Drift Detection: The Alert You Don't Have But Need

Observability Tool Selection

Related guides

You understand the gaps.Get the credential that proves it.

Get framework updates in your inbox

You understand the gaps.
Get the credential that proves it.