Production AI Institute · PSF v1.1 open standard

AI Right-To-Know AI Data Use Index Check My AI Tools Policy Change Watch Agent Readiness Public Benchmark Contact

CPAP · Practitioner

Assessment Guide: Certified Production AI Practitioner

This guide covers all 8 PSF domains assessed in the CPAP portfolio review at practitioner depth. Each domain includes key concepts, a worked scenario, and the reasoning approach assessors expect. CPAP evidence is more complex than AIDA — it tests your ability to navigate real production trade-offs, not just recall correct principles.

View CPAP pathway →Build evidence plan →Assessment rubric →

Assessment at a glance

Submission

Production deployment portfolio

Review

PAI assessor review

Turnaround

5 business days

Prerequisite

AIDA recommended

Fee

$297

Credential

Digital certificate + verifiable registry listing

How CPAP assessment differs from AIDA

AIDA questions ask: what is the right principle to apply here? CPAP assessment asks: you're in a real production situation with competing pressures — what did you build, why was that the correct decision, and what evidence proves it works?

CPAP reviewers look for structured reasoning, correct prioritisation, and concrete evidence: architecture choices, validation results, oversight design, incident readiness, observability, and measurable outcomes.

PSF-1

Input Governance & Prompt Security

15–20% of assessment focus

Key concepts

Prompt injection taxonomy: direct, indirect, payload injection
Input validation layers: schema, semantic, length, rate
Adversarial input detection without false positive overload
Content filtering policy design for multi-tenant systems
System prompt confidentiality and leak prevention
Input logging strategy: privacy vs. security trade-off

Worked scenario

A B2B SaaS product embeds customer-uploaded PDFs into LLM context for question answering. During red-teaming, you discover that a specially crafted PDF can override system instructions. The product team wants to strip all text over 500 tokens from uploaded documents.

Expert reasoning

Token-length stripping is a blunt mitigation that doesn't address the root cause and will break legitimate use cases. The correct approach is: (1) treat all document content as untrusted user input — never inject it into privileged instruction context; (2) use a retrieval architecture where document content is always in user/context position, never system position; (3) add semantic anomaly detection for instruction-like patterns in uploaded content. Stripping long content is a secondary defence, not a primary one.

PSF-2

Output Validation & Reliability

15–20% of assessment focus

Key concepts

Structured output enforcement: schemas, constrained decoding, retry loops
Hallucination detection: grounding checks, citation verification, factuality scores
Confidence thresholds and abstention strategies
Output length and format validation before downstream consumption
Multi-model validation for high-consequence decisions
Graceful degradation when validation fails

Worked scenario

Your AI system generates contract clause summaries that legal teams review. An audit finds that 3% of summaries contain subtle inaccuracies that passed human review because reviewers trusted the AI. Leadership wants to set a higher model temperature to get 'more natural' summaries.

Expert reasoning

This is a reliability problem, not a style problem — increasing temperature makes it worse. The audit finding suggests over-trust in AI output, not a quality gap in the writing style. Correct remediation: (1) add a grounding check that verifies each summary claim against the source clause text; (2) implement a confidence-based flagging system that highlights low-certainty summaries for closer human review; (3) brief the legal team on the specific failure mode found so they know what to look for. Temperature is a content quality lever, not a reliability lever.

PSF-3

Data Protection & Privacy

10–15% of assessment focus

Key concepts

PII detection and redaction in AI pipelines
Training data lineage and consent documentation
Data minimisation: what AI systems should not retain
Cross-border data transfer constraints (GDPR, SCCs, adequacy decisions)
Right to erasure: implications for models trained on personal data
Data retention policy for AI inference logs

Worked scenario

Your RAG system indexes internal HR documents including employee performance reviews. An employee requests deletion of their data under GDPR Article 17. Legal confirms the request is valid. What are the full scope of actions required?

Expert reasoning

GDPR erasure for RAG systems is substantially more complex than database deletion. Required actions: (1) delete the source document from the document store; (2) delete all associated chunks from the vector database — this requires knowing which vector embeddings map to that document; (3) invalidate any cached responses that may have included that employee's data; (4) if any fine-tuned models were trained on that data, assess whether the personal data is 'memorised' — if so, retraining or model deletion may be required; (5) document the erasure with a completion certificate. The vector store step is the one most teams miss — embeddings are derived data containing personal information and must be deleted.

PSF-4

Observability & Monitoring

10–15% of assessment focus

Key concepts

The four AI observability layers: request, model, outcome, system
Latency SLO definition for AI endpoints (p50/p95/p99)
Quality drift detection: embedding drift, output distribution shift
Alert design: what requires PagerDuty, what goes to a dashboard
Sampling strategy for high-volume AI logs
Feedback loop instrumentation: capturing ground truth for model quality

Worked scenario

Three months after deploying a document classification AI, accuracy has dropped from 94% to 87% with no model changes. Your monitoring setup tracks latency, error rate, and API costs. What was missing from your observability setup and how would you have caught this earlier?

Expert reasoning

The missing layer is output quality monitoring. Tracking infrastructure metrics (latency, errors, cost) tells you nothing about model quality degradation. The correct approach requires: (1) a ground truth feedback loop — a sample of classifications verified by humans each week, creating a quality time series; (2) input distribution monitoring — tracking embedding drift or feature statistics on incoming documents to detect when the document population has shifted from training data; (3) an output distribution monitor — tracking the proportion of each class over time to flag unexpected shifts. The 87% accuracy was invisible for 3 months because none of these were in place.

PSF-5

Deployment Safety & Incident Response

10% of assessment focus

Key concepts

AI incident taxonomy: quality failures, safety failures, capability failures, availability failures
Severity classification for AI-specific incidents
Containment options: rate limit, rollback, kill switch, human takeover
Post-mortem structure for AI incidents (distinct from software post-mortems)
Communication strategy: users, regulators, leadership
Preventing recurrence: feedback into model and governance, not just ops

Worked scenario

Your AI-powered loan pre-approval system has been running for 6 months. An internal audit discovers that it has been approving applications at a significantly higher rate for one demographic group, with no business justification. How do you respond?

Expert reasoning

This is a Severity 1 AI incident — it's a potential discriminatory outcome with regulatory and legal exposure. Immediate actions: (1) suspend automated pre-approvals within the hour — do not wait for root cause analysis; (2) notify legal and compliance leadership immediately; (3) pull the full 6-month decision log for forensic analysis — you need the demographic breakdown, approval rates, and the model features driving decisions; (4) do not delete or modify any data — this is potential evidence. Investigation: determine whether the disparity stems from training data, feature selection, or a proxy variable (postcode, loan size) correlating with demographic. Resolution: decisions made under the biased model may need to be reviewed and potentially reversed. Regulatory notification may be required under applicable law.

PSF-6

Human Oversight & Autonomy Design

10–15% of assessment focus

Key concepts

Autonomy level selection: classification framework (L0–L4)
Review trigger design: consequence-based, confidence-based, novelty-based
Override mechanism implementation and audit logging
HITL throughput planning: don't create review bottlenecks that bypass themselves
Staged automation: how to move from L1 to L3 safely
Disclosing AI nature to affected users (when and how)

Worked scenario

Your customer service AI handles 10,000 queries per day and escalates 8% to human agents. The head of operations proposes reducing escalations to 2% by raising the AI's confidence threshold. The head of compliance says this will reduce human oversight. Who is right?

Expert reasoning

Both are partially right, but the compliance concern is more important to address properly. Raising the confidence threshold doesn't reduce oversight — it changes the selection of what gets reviewed. If the 2% that gets escalated is the genuinely ambiguous 2%, that's arguably better oversight than reviewing a random 8%. The key questions are: (1) is confidence score actually a good proxy for cases that need human review, or are there systematic failure modes the score doesn't capture? (2) what happens to the escalation cases that are now auto-resolved — does the system have a feedback loop to catch errors? Oversight design should be consequence-driven and outcome-tested, not percentage-driven.

PSF-7

Security & Access Control

5–10% of assessment focus

Key concepts

Least-privilege tool and data access for AI agents
Tenant isolation in retrieval and context assembly
Secret management for prompts, keys, and integration credentials
Prompt injection testing as an AI-specific security control
Audit trails for tool use and privileged actions
Secure handling of model, vector, and orchestration dependencies

Worked scenario

Your multi-tenant AI support product retrieves account notes from a shared vector database. A red-team test shows that one tenant can craft prompts that sometimes surface another tenant's billing context. What is the correct security response?

Expert reasoning

Treat this as a tenant-isolation security incident, not just a prompt-quality problem. Immediate actions: (1) disable or restrict the affected retrieval path until isolation is proven; (2) enforce tenant ID filtering server-side in every retrieval query and context assembly step; (3) add tests that attempt cross-tenant retrieval through direct prompts, indirect prompt injection, and malformed metadata; (4) audit logs for prior exposure and determine notification obligations; (5) review least-privilege access for the vector store and any debugging tools. Prompt instructions alone cannot enforce access control.

PSF-8

Vendor Resilience & Model Risk

10% of assessment focus

Key concepts

Third-party model dependency mapping
Capability change risk: what happens when the model provider updates silently
Version pinning strategy for LLM APIs
Vendor exit planning: what would you need to migrate to an alternative?
Supply chain integrity: fine-tuning data, adapters, embeddings
SLA and uptime contractual requirements for production AI dependencies

Worked scenario

Your production system uses GPT-4 via the OpenAI API. OpenAI announces that GPT-4 will be deprecated in 90 days and recommends migrating to GPT-4o. Your system has 18 months of prompt engineering optimised specifically for GPT-4's output patterns. What is the correct production approach to this migration?

Expert reasoning

This is a vendor-driven forced migration — the 90-day timeline is aggressive for a production system with significant prompt investment. Correct approach: (1) do not migrate directly to production — run GPT-4o in parallel with GPT-4 across a sample of production traffic immediately; (2) build a regression test suite from your existing prompt evaluations — every known edge case and critical output pattern must be tested against GPT-4o; (3) identify which prompts are most sensitive to model behaviour changes (structured output prompts, few-shot examples, chain-of-thought chains) and prioritise those for re-optimisation; (4) plan for a phased rollout — not a cutover — with fallback capability. The 90-day window should be used for testing, not building. The system should be ready to traffic-shift at day 60 with 30 days of runway for issues.

Ready to submit your portfolio?

Portfolio submission · Assessor review · 5 business day turnaround · $297

Build evidence plan →Begin your portfolio →