AI Agent Production Ready Checklist
A PSF-aligned sign-off checklist for autonomous agents: 32 controls across eight domains, with a ship, harden, or rebuild decision framework.
An agent is production ready when it can run unattended against real users, real data, and real tools without silent failure modes that tutorials do not surface. Assessments indicate that most agent pilots pass demo evals but fail on input governance, output validation, or post-deployment monitoring gaps documented in NIST's 2025 work on deployed AI system monitoring (NIST.AI.800-4).
This checklist maps 32 concrete controls to the Production Safety Framework (PSF) eight domains. It complements the narrative primer Your AI Agent Isn't Production Ready with an operational sign-off format procurement and platform teams can reuse.
How to use this checklist
Assign a named owner per domain. Score each item yes, no, or not applicable with evidence links (runbook, test output, dashboard). Require all applicable items to be yes before production traffic. For agents with tool access or write permissions, treat PSF-6 and PSF-7 as blocking domains regardless of demo quality.
OpenAI's Agents SDK deployment guidance (May 2026) expects a /health endpoint, versioned prompts, runtime secret injection, and trace export via flush_traces() in long-running workers. Those requirements align with PSF-4 and PSF-5 below; they are necessary but not sufficient for full PSF coverage.
Input Governance
Practitioner guide: Input Governance implementation guide· Certification: CAIS
- Maximum input length enforced before any LLM call, with rejection logging
- Rate limiting at user and token level, not only at the API gateway
- PII detection on inbound user content before it enters the model context
- Prompt injection test suite run against direct and indirect attack patterns
Output Validation
Practitioner guide: Output Validation implementation guide· Certification: CAOP
- Structured outputs validated against a schema before downstream writes
- Business logic checks reject impossible values (dates, amounts, IDs)
- Low-confidence outputs routed to human review, not auto-executed
- Golden eval set scored on every prompt or model change before rollout
Data Protection
Practitioner guide: Data Protection implementation guide· Certification: CAIA
- Data minimisation documented: each prompt field has a named lawful purpose
- PII masked or tokenised before transmission to third-party model APIs
- Data processor agreements in place with model and tool providers
- Erasure path verified: deleted user data leaves logs, caches, and vector stores
Observability
Practitioner guide: Observability implementation guide· Certification: CAOP
- End-to-end traces capture LLM calls, tool invocations, and handoffs
- Latency tracked at P50, P95, and P99 with alerting on regression
- Token cost per request monitored with budget thresholds
- Conversation logs stored with PII redaction and searchable incident fields
Deployment Safety
Practitioner guide: Deployment Safety implementation guide· Certification: CLOE
- Kill switch disables the agent in under 60 seconds without redeploying
- Prompt and model versions stored in version control with changelogs
- Canary rollout: new versions start at 1% to 5% of traffic with promotion gates
- Rollback tested: previous version restorable in under five minutes
Human Oversight
Practitioner guide: Human Oversight implementation guide· Certification: CAIG
- Every agent action classified by reversibility and business consequence
- Irreversible or high-stakes actions require explicit human approval
- Escalation path defined for uncertainty, tool failure, and policy edge cases
- Scheduled human review of agent outputs, not only complaint-driven review
Security
Practitioner guide: Security implementation guide· Certification: CAIS
- Tool permissions scoped to least privilege per task or tenant
- Secrets injected at runtime; never embedded in prompts, manifests, or logs
- Multi-tenant isolation verified: one user context cannot leak to another
- Adversarial red-team results documented with remediation owners
Vendor Resilience
Practitioner guide: Vendor Resilience implementation guide· Certification: CVAE
- Model provider outages trigger graceful degradation, not hard failure
- Retry logic uses exponential backoff with circuit-breaker limits
- Model version pinned; upgrades are deliberate, not automatic
- Fallback provider or reduced-capability mode documented and tested
Ship, harden, or rebuild
If production requires regulated evidence, map completed controls to PSF compliance and practitioner certifications (CAOP for operations, CAIS for security, CAIG for governance). MSPs deploying agents for clients should align sign-off to the MSP AI certification guide.
Sources and further reading
- NIST Center for AI Standards and Innovation, Challenges to the Monitoring of Deployed AI Systems (NIST.AI.800-4, 2025)
- NIST AI RMF Generative AI Profile (NIST.AI.600-1): go/no-go deployment thresholds and ongoing capability review
- OpenAI, Agents SDK Deployment Manager and tracing documentation (openai-agents-python, 2026)
- OpenAI API, Sandbox Agents guide: runtime secret injection and scoped workspace mounts (2026)
- Production AI Institute, Seven Failure Modes of Production AI
Turn the evidence into production practice.
Use the PSF, research library, and Lab material to review your own deployment. Credentials are available when a client, employer, or regulator needs public proof.