Production AI Institute · Independent certification for production AI practice
Verify a credential|Contact|

Insights / PSF / Domain Guide

Production AI Institute — PSF Domain Guide v1.0
Published: 2026-04-29 · License: CC BY 4.0
Domain: PSF-4 — Observability & Monitoring
PSF-4

Observability & Monitoring

Observability is the property of a system that allows you to understand its internal state from its external outputs. For AI systems in production, observability is not optional — it is the mechanism by which you know whether the system is working, degrading, drifting, or failing. Without it, you are operating blind.

What AI Observability Must Cover

Inference logging

Every model call should be logged: input (or a hash of it), output (or a sample), model version, latency, token counts, cost, and any routing or filtering decisions. This is the raw material for all other observability.

Quality scoring

Define a quality metric for your system and score every (or a statistically valid sample of) outputs against it. Quality scoring can be automated (LLM-as-judge, rule-based), human-reviewed, or a combination. The score must trend visibly.

Drift detection

Monitor the statistical properties of inputs and outputs over time. Population stability index (PSI), KL divergence, or simpler distribution statistics can detect when the data your system sees is shifting away from what it was designed for.

Latency and cost dashboards

P50/P95/P99 latency, tokens per request, cost per request, and error rates should be visible in real time. Cost spikes often precede quality degradation — they are a leading indicator, not just an operations metric.

Alert thresholds

Monitoring without alerting is archaeology. Define threshold-based alerts for: error rate, latency P99, quality score drop, drift PSI breach, and cost spike. Route alerts to named on-call owners, not generic inboxes.

Model version tracking

Every inference log must include the model version (including provider-side model versions, which can change silently for hosted models). Unexplained quality changes are almost always correlated with model version changes.

The Silent Degradation Problem

The most dangerous AI production failure mode is silent degradation — the system continues to respond, error rates remain low, but output quality is declining. This is invisible without active quality monitoring. A system that returns plausible-sounding but increasingly incorrect outputs will not trigger any infrastructure alert. Only a quality scoring pipeline — running continuously and trending over time — will catch it. This is why PSF-4 mandates quality scoring, not just infrastructure health monitoring.

Logging Architecture for AI Systems

AI inference logs have different characteristics from application logs. They are larger (full prompt and response text), more sensitive (may contain PII, confidential data, or proprietary system prompts), and more valuable for retrospective analysis. Log architecture for AI systems must address: structured format (JSON, not free text), PII handling (redaction at the logging layer, not just at the application layer), retention policy (long enough for model comparison, short enough to meet data protection obligations), access controls (inference logs should not be universally readable), and searchability (you need to find the exact prompt and response for any given incident).

PSF-4 Compliance Checklist

Structured inference logging for every model call (input hash, output sample, model version, latency, cost)
Quality scoring pipeline running continuously with visible trends
Drift detection implemented on input and output distributions
Real-time latency, error rate, and cost dashboards
Alert thresholds defined and routed to named on-call owners
Model version tracked in every inference log (including provider-side versions)
Log retention policy defined — long enough for analysis, short enough for data protection
PII redaction applied at the logging layer
Access controls on inference logs (not universally readable)
Monthly review of quality trends, not just incident-triggered review

Provider-Side Model Changes

A specific PSF-4 risk that catches many teams off-guard: hosted model providers (OpenAI, Anthropic, Google, etc.) update their models without always providing prominent advance notice. A model that was GPT-4-turbo-2024-04-09 in your logs is a specific version; GPT-4-turbo without a version pinned means you may be running a different model today than you were last month. PSF-4 requires version pinning where the API supports it, and quality monitoring rigorous enough to detect silent version changes where it does not.

AIDA Exam Tips for PSF-4

  • PSF-4 questions test whether you know the difference between infrastructure monitoring (latency, errors) and AI-specific monitoring (quality scoring, drift). Infrastructure monitoring alone is not PSF-4 compliance.
  • Silent degradation scenario: a system looks healthy on all infrastructure metrics but users are complaining about quality. The PSF-4 answer is a quality scoring pipeline — not more infrastructure alerts.
  • Drift detection questions: know that PSI (Population Stability Index) is the standard metric for detecting input distribution shift. A PSI > 0.2 is a common threshold for requiring investigation.
  • Version tracking: if a scenario describes unexplained quality changes after a model provider update, the PSF-4 failure is lack of model version pinning + no version tracking in inference logs.
  • Cost spike questions: in PSF-4 context, a sudden cost spike is a signal to investigate quality, not just to optimise spend. It often indicates prompt injection (longer, adversarial prompts) or model routing failure.

Certifications that assess PSF-4

AIDA ExaminationCLLO — LLM OperationsCPAP Portfolio
Full PSF FrameworkStudy GuidePractice Exam