Production AI Institute · Independent certification for production AI practice
Verify a credential|Contact|

Insights / PSF / Domain Guide

Production AI Institute — PSF Domain Guide v1.0
Published: 2026-04-29 · License: CC BY 4.0
Domain: PSF-5 — Deployment Safety
PSF-5

Deployment Safety

Deploying an AI system to production is not the end of the risk management process — it is the beginning of it. PSF-5 addresses the operational controls that sit between a model passing evaluation and that model serving real users with real consequences. These controls do not assume the model is perfect. They assume it is not.

Why AI Deployments Fail Differently

Traditional software deployments fail in ways that are usually deterministic and quickly observable: the service crashes, returns a 500, or produces obviously wrong output. AI system deployments fail in ways that are probabilistic, gradual, and often invisible to infrastructure monitoring. A new model version may perform worse on a specific input distribution that wasn't represented in the evaluation set. A fine-tuned model may have lost capability in a domain that wasn't tested. The deployment controls that work for conventional software need to be extended — not replaced — for AI systems.

Canary Releases for AI Systems

A canary release routes a small percentage of production traffic to the new model version while the majority of traffic continues on the existing version. For AI systems, canary releases require a quality scoring mechanism capable of comparing outputs from the canary and the control — not just error rates. A canary that produces slightly lower quality outputs at the same latency and error rate will look identical on infrastructure metrics. Define a canary success criterion that includes output quality, not just availability.

PSF-5 Deployment Controls

Canary/progressive release

Route new model versions to a small traffic percentage initially. Define promotion criteria (quality metrics, not just error rate) and automate rollback if criteria are not met within a defined observation window.

Rollback procedure

Every AI deployment must have a tested, documented rollback procedure with a defined RTO (recovery time objective). 'We can roll back' is not a procedure. 'Rollback takes 15 minutes, triggered by the on-call, restores model version X' is a procedure.

Circuit breakers

Implement circuit breakers that automatically suspend AI system operation when error rate, latency, or quality score breaches a defined threshold. The circuit breaker should activate the fallback path, not leave the system in a degraded state.

Environment parity

The evaluation environment should match production as closely as possible. Differences in data distribution, system prompts, integration context, and user behaviour between evaluation and production are common sources of deployment surprises.

Shadow mode testing

Before promoting a new model version, run it in shadow mode: it receives production traffic and generates outputs, but those outputs are not served to users. Compare shadow outputs against the current model. This is the highest-fidelity pre-promotion evaluation.

Deployment freeze windows

Define periods during which AI model changes are prohibited: high-traffic periods, immediately before major business events, and during on-call reduction periods. AI deployment incidents during peak traffic have disproportionate impact.

The Fallback Path

Every production AI system must have a defined fallback path — what happens when the AI component is suspended or unavailable. The fallback path is not a failure mode; it is a designed operational state. Acceptable fallback paths include: routing to a simpler, more reliable model; returning a structured 'not available' response; queuing requests for batch processing; or routing to a human queue. The fallback path must be tested regularly — not just documented. A fallback that has never been exercised under load may not work when needed.

PSF-5 Compliance Checklist

Canary release process defined with quality-based (not just availability-based) promotion criteria
Rollback procedure documented with tested RTO
Circuit breakers implemented with defined thresholds and tested activation
Fallback path defined, documented, and regularly tested under realistic conditions
Environment parity assessment completed before each major model deployment
Shadow mode testing capability available for high-risk model updates
Deployment freeze windows defined and enforced
On-call runbook includes AI-specific incident response steps
Post-deployment monitoring window defined (enhanced monitoring for 48–72 hours after any model change)
Deployment approval process includes sign-off from named technical owner

AIDA Exam Tips for PSF-5

  • PSF-5 is about the deployment process, not the model itself. Questions that describe what happens DURING a deployment (rollout, rollback, incidents) are PSF-5.
  • Canary release questions: the correct answer includes quality metric monitoring during the canary phase, not just error rate monitoring.
  • Rollback questions: 'we can roll back' is not sufficient. The exam tests whether you know rollback needs a defined RTO and a tested procedure.
  • Circuit breaker questions: the PSF-5 circuit breaker activates a fallback path. 'Alert the team and investigate' is not a circuit breaker response.
  • Environment parity is a common exam blind spot — differences between staging/evaluation and production environments are a leading cause of post-deployment surprises in AI systems.

Certifications that assess PSF-5

AIDA ExaminationCLLO — LLM OperationsCPAP Portfolio
Full PSF FrameworkStudy GuidePractice Exam