This guide covers all 8 PSF domains assessed in the CPAP portfolio review at practitioner depth. Each domain includes key concepts, a worked scenario, and the reasoning approach assessors expect. CPAP evidence is more complex than AIDA — it tests your ability to navigate real production trade-offs, not just recall correct principles.
AIDA questions ask: what is the right principle to apply here? CPAP assessment asks: you're in a real production situation with competing pressures — what did you build, why was that the correct decision, and what evidence proves it works?
CPAP reviewers look for structured reasoning, correct prioritisation, and concrete evidence: architecture choices, validation results, oversight design, incident readiness, observability, and measurable outcomes.
A B2B SaaS product embeds customer-uploaded PDFs into LLM context for question answering. During red-teaming, you discover that a specially crafted PDF can override system instructions. The product team wants to strip all text over 500 tokens from uploaded documents.
Token-length stripping is a blunt mitigation that doesn't address the root cause and will break legitimate use cases. The correct approach is: (1) treat all document content as untrusted user input — never inject it into privileged instruction context; (2) use a retrieval architecture where document content is always in user/context position, never system position; (3) add semantic anomaly detection for instruction-like patterns in uploaded content. Stripping long content is a secondary defence, not a primary one.
Your AI system generates contract clause summaries that legal teams review. An audit finds that 3% of summaries contain subtle inaccuracies that passed human review because reviewers trusted the AI. Leadership wants to set a higher model temperature to get 'more natural' summaries.
This is a reliability problem, not a style problem — increasing temperature makes it worse. The audit finding suggests over-trust in AI output, not a quality gap in the writing style. Correct remediation: (1) add a grounding check that verifies each summary claim against the source clause text; (2) implement a confidence-based flagging system that highlights low-certainty summaries for closer human review; (3) brief the legal team on the specific failure mode found so they know what to look for. Temperature is a content quality lever, not a reliability lever.
Your RAG system indexes internal HR documents including employee performance reviews. An employee requests deletion of their data under GDPR Article 17. Legal confirms the request is valid. What are the full scope of actions required?
GDPR erasure for RAG systems is substantially more complex than database deletion. Required actions: (1) delete the source document from the document store; (2) delete all associated chunks from the vector database — this requires knowing which vector embeddings map to that document; (3) invalidate any cached responses that may have included that employee's data; (4) if any fine-tuned models were trained on that data, assess whether the personal data is 'memorised' — if so, retraining or model deletion may be required; (5) document the erasure with a completion certificate. The vector store step is the one most teams miss — embeddings are derived data containing personal information and must be deleted.
Three months after deploying a document classification AI, accuracy has dropped from 94% to 87% with no model changes. Your monitoring setup tracks latency, error rate, and API costs. What was missing from your observability setup and how would you have caught this earlier?
The missing layer is output quality monitoring. Tracking infrastructure metrics (latency, errors, cost) tells you nothing about model quality degradation. The correct approach requires: (1) a ground truth feedback loop — a sample of classifications verified by humans each week, creating a quality time series; (2) input distribution monitoring — tracking embedding drift or feature statistics on incoming documents to detect when the document population has shifted from training data; (3) an output distribution monitor — tracking the proportion of each class over time to flag unexpected shifts. The 87% accuracy was invisible for 3 months because none of these were in place.
Your AI-powered loan pre-approval system has been running for 6 months. An internal audit discovers that it has been approving applications at a significantly higher rate for one demographic group, with no business justification. How do you respond?
This is a Severity 1 AI incident — it's a potential discriminatory outcome with regulatory and legal exposure. Immediate actions: (1) suspend automated pre-approvals within the hour — do not wait for root cause analysis; (2) notify legal and compliance leadership immediately; (3) pull the full 6-month decision log for forensic analysis — you need the demographic breakdown, approval rates, and the model features driving decisions; (4) do not delete or modify any data — this is potential evidence. Investigation: determine whether the disparity stems from training data, feature selection, or a proxy variable (postcode, loan size) correlating with demographic. Resolution: decisions made under the biased model may need to be reviewed and potentially reversed. Regulatory notification may be required under applicable law.
Your customer service AI handles 10,000 queries per day and escalates 8% to human agents. The head of operations proposes reducing escalations to 2% by raising the AI's confidence threshold. The head of compliance says this will reduce human oversight. Who is right?
Both are partially right, but the compliance concern is more important to address properly. Raising the confidence threshold doesn't reduce oversight — it changes the selection of what gets reviewed. If the 2% that gets escalated is the genuinely ambiguous 2%, that's arguably better oversight than reviewing a random 8%. The key questions are: (1) is confidence score actually a good proxy for cases that need human review, or are there systematic failure modes the score doesn't capture? (2) what happens to the escalation cases that are now auto-resolved — does the system have a feedback loop to catch errors? Oversight design should be consequence-driven and outcome-tested, not percentage-driven.
Your multi-tenant AI support product retrieves account notes from a shared vector database. A red-team test shows that one tenant can craft prompts that sometimes surface another tenant's billing context. What is the correct security response?
Treat this as a tenant-isolation security incident, not just a prompt-quality problem. Immediate actions: (1) disable or restrict the affected retrieval path until isolation is proven; (2) enforce tenant ID filtering server-side in every retrieval query and context assembly step; (3) add tests that attempt cross-tenant retrieval through direct prompts, indirect prompt injection, and malformed metadata; (4) audit logs for prior exposure and determine notification obligations; (5) review least-privilege access for the vector store and any debugging tools. Prompt instructions alone cannot enforce access control.
Your production system uses GPT-4 via the OpenAI API. OpenAI announces that GPT-4 will be deprecated in 90 days and recommends migrating to GPT-4o. Your system has 18 months of prompt engineering optimised specifically for GPT-4's output patterns. What is the correct production approach to this migration?
This is a vendor-driven forced migration — the 90-day timeline is aggressive for a production system with significant prompt investment. Correct approach: (1) do not migrate directly to production — run GPT-4o in parallel with GPT-4 across a sample of production traffic immediately; (2) build a regression test suite from your existing prompt evaluations — every known edge case and critical output pattern must be tested against GPT-4o; (3) identify which prompts are most sensitive to model behaviour changes (structured output prompts, few-shot examples, chain-of-thought chains) and prioritise those for re-optimisation; (4) plan for a phased rollout — not a cutover — with fallback capability. The 90-day window should be used for testing, not building. The system should be ready to traffic-shift at day 60 with 30 days of runway for issues.
Portfolio submission · Assessor review · 5 business day turnaround · $297