New from the Lab·The Compass — an open moral reasoning standard for AI, tested across frontier modelsExplore →
Production AI Institute · PSF v1.1 open standard
AI Right-To-KnowAI Data Use IndexCheck My AI ToolsPolicy Change WatchAgent ReadinessPublic BenchmarkContact
AIDA Study Guide · PSF v1.1

Production AI Failure Cases

Eight real-world scenarios, one per PSF domain. Each case walks through what went wrong, why it went wrong, and exactly how to fix it. Read these instead of flashcards.

📖
Go deeper on every domain

Each scenario below is backed by a complete PSF deep dive guide — implementation patterns, framework comparisons, and production case studies.

Ready to test yourself?

Practice exam with instant feedback, then take the real AIDA when you're ready.

CAIA Exam Prep — Audit Track

PAI-8 Control Audit Case Studies

Eight real-world audit vignettes — one per PAI-8 control. Each case presents an organisational scenario, the auditor's analysis, evidence requirements, and findings classification. These cases map directly to the 30-question CAIA exam.

C1
AI Governance
The CISO who owns AI risk (and the CTO who disagrees)
Scenario

An energy company has a published AI ethics policy, an AI Ethics Officer role on the org chart, and a quarterly AI risk meeting. During your interview, the CISO says they own all AI risk. The CTO says AI governance is a technical matter owned by the AI team. The Legal team says they signed off the policy but have not attended any risk meeting. You find no decision gate records showing AI governance applied to any deployment decision in the past 12 months.

Audit findings
  • Accountability is contested — no RACI documented for AI governance roles
  • The governance committee exists on paper but cannot evidence decisions made
  • No deployment was routed through a governance gate in 12 months
  • Policy exists but is not operationally embedded
Maturity classification

C1 maturity: L1 (Basic). Policy exists and a role is assigned, but governance is not applied to real decisions. L2 requires documented decision gate records.

Evidence required
  • RACI matrix for AI governance roles
  • AI risk committee meeting minutes with AI-specific agenda items
  • Deployment approval records showing governance gate was applied
  • Training records for staff on AI ethics policy
Auditor lesson

A governance framework that does not produce evidence of decisions being made is an L1 finding regardless of how many committees exist. Look for records, not org charts.

C2
Risk Assessment
Annual assessment, quarterly model updates
Scenario

A healthcare provider completes a formal AI risk assessment each January covering all deployed AI systems. In March they upgrade their patient triage model to a new foundation model. In July they expand its use from emergency triage to routine appointment prioritisation. In October you audit them. They present the January risk assessment as evidence of C2 compliance.

Audit findings
  • No trigger-based reassessment was conducted after the March model upgrade
  • Scope expansion in July (new use case) was not treated as a reassessment trigger
  • The July use case expansion is materially higher risk than the original deployment
  • Annual-only cadence at L1; L2 requires documented trigger-based reassessment
Maturity classification

C2 maturity: L1. Annual assessment exists but trigger-based reassessment is absent. Model upgrade and use case expansion are both explicit PAI-8 C2 triggers.

Evidence required
  • Risk assessment dated after each material model change
  • Trigger log: list of events that prompted reassessments and the outcome
  • Risk register showing each deployment's current risk tier
  • Sign-off records for use case expansions
Auditor lesson

Annual cadence is the minimum baseline. Trigger-based reassessment is the differentiator between L1 and L2. Always check whether model changes or use case expansions occurred after the last assessment.

C3
Data Stewardship
Fine-tuned on customer emails nobody read
Scenario

A SaaS company fine-tuned a support model on 3 years of customer support emails. The emails contained PII, occasional health disclosures, and financial information. The data team confirms no consent was obtained for model training use. The model is now in production. When asked for a data lineage document, the data team provides a pipeline diagram showing how data flows into training — but no record of what data was included, excluded, or reviewed.

Audit findings
  • No consent or lawful basis documented for using customer data in model training
  • No data lineage record identifying what was included in training set
  • PII present in training data with no evidence of masking or exclusion process
  • No data retention or deletion policy covering training artefacts
Maturity classification

C3 maturity: L1. Training data exists and was used, but provenance, consent, and documentation are absent. L2 requires documented provenance and consent/lawful basis.

Evidence required
  • Data provenance document for each training dataset
  • Consent mechanism or lawful basis assessment for training data use
  • PII handling procedure for model training pipelines
  • Training data retention and deletion policy
Auditor lesson

A pipeline diagram is not a data stewardship record. Look for consent documentation, exclusion records, and evidence that the organisation actually reviewed what went into training.

C4
Model Validation
The model that passed every test except the real one
Scenario

A financial services firm deploys a credit scoring model. They present benchmark results from three academic datasets and an internal A/B test showing 94% accuracy. In production, the model disproportionately denies credit to applicants from two postcodes, which correlates strongly with ethnicity. No bias testing was conducted pre-deployment. The model was approved by the AI team without an independent review.

Audit findings
  • Pre-deployment evaluation used academic benchmarks not representative of production population
  • No bias testing or fairness evaluation conducted on protected characteristics
  • No independent review — model was approved by the team that built it
  • No performance monitoring in production to detect post-deployment drift
Maturity classification

C4 maturity: L1. Pre-deployment testing occurred but lacked bias evaluation and independent review. Deployment without fairness testing in a high-risk use case is a C4 critical finding.

Evidence required
  • Pre-deployment evaluation report covering bias and fairness metrics
  • Independent review sign-off (not from the development team)
  • Production monitoring dashboard with fairness metrics tracked over time
  • Documented approval gate criteria for high-risk AI deployments
Auditor lesson

Accuracy metrics do not substitute for fairness evaluation. For high-risk use cases, look for evidence of protected characteristic testing and independent review.

C5
Human Oversight
Override that nobody knows how to use
Scenario

A logistics company deploys a route optimisation agent that makes autonomous delivery reassignment decisions. The system has a documented human override procedure in the operations manual. During stakeholder interviews, three of four operations managers are unaware an override is possible. The fourth found it 'by accident'. No override has ever been logged. The system has been live for 18 months.

Audit findings
  • Override mechanism exists but operational staff are not trained on its use
  • No override event has been logged in 18 months — likely undiscovered, not unused
  • No escalation path documented for situations where the model's decisions appear anomalous
  • Autonomy limits are not defined — the system's decision scope has never been formally bounded
Maturity classification

C5 maturity: L1. Human oversight mechanism exists on paper but is not operational. L2 requires staff awareness, training records, and logged override events.

Evidence required
  • Training records showing operations staff completed human oversight training
  • Override usage logs (presence of zero events with no documented rationale is a finding)
  • Escalation runbook for anomalous model behaviour
  • Documented autonomy scope defining what decisions the model can and cannot make autonomously
Auditor lesson

An override nobody knows about is not human oversight — it is documentation. The test is whether a human can actually intervene, not whether an override exists in a manual.

C6
Incident Response
AI incident classified as a software bug
Scenario

A retailer's recommendation engine begins surfacing inappropriate product combinations to users — recommending items frequently associated with self-harm alongside unrelated products. The issue persisted for 6 hours before being noticed. It was logged in the IT incident management system as a 'software defect — recommendation algorithm' with a 3-day response SLA. No post-incident review was conducted. No regulators were notified.

Audit findings
  • AI incidents are not classified separately from software defects — no AI incident taxonomy exists
  • SLA applied (3 days) was appropriate for software defects, not for an AI harm incident
  • No post-incident review conducted — root cause not established
  • No assessment of regulatory notification obligation was made (GDPR, sector rules)
Maturity classification

C6 maturity: L0–L1. No AI-specific incident classification exists. The incident was handled under a generic IT process that is structurally inadequate for AI harm events.

Evidence required
  • AI incident classification taxonomy with severity tiers for AI-specific harms
  • Incident response runbook with AI-specific triggers and response steps
  • Post-incident review records for significant AI incidents
  • Regulatory notification assessment process for AI-related incidents
Auditor lesson

AI incidents do not map cleanly to software defect categories. Without an AI-specific incident taxonomy, high-severity AI harm events will be systematically mis-classified and under-responded to.

C7
Audit Trail
The decision that cannot be reconstructed
Scenario

A benefits agency uses an AI to assess eligibility for housing assistance. An applicant challenges a denial. The agency's legal team requests the reasoning behind the specific decision. The AI team can produce the model version number but cannot reconstruct: which data was used for this specific applicant, what the model's intermediate reasoning was, or why this decision differed from similar applicants. Logs are retained for 30 days; the decision was made 45 days ago.

Audit findings
  • Decision-level logging does not capture input data used for each specific decision
  • Model reasoning / intermediate steps are not logged
  • Log retention policy (30 days) is shorter than the period in which decisions can be challenged
  • No explainability artefacts are produced for high-risk decisions
Maturity classification

C7 maturity: L1. Some logging exists (model version) but decision-level audit trail is inadequate for the use case. In regulated contexts, this is a critical finding.

Evidence required
  • Decision log schema showing what is recorded per decision (inputs, model version, output, confidence, timestamp)
  • Log retention policy aligned to the legal challenge window for the use case
  • Explainability output for high-risk decisions (LIME, SHAP, or structured reasoning trace)
  • Immutable audit trail — evidence that logs cannot be altered after the fact
Auditor lesson

A model version number is not an audit trail. Auditable AI requires decision-level logging: who was assessed, with what data, by which model, producing what output. Retention must outlast the challenge window.

C8
Vendor & Supply Chain
The API that disappeared overnight
Scenario

A legal tech company's contract analysis product is built on a third-party model API. The API provider announces a 30-day deprecation of the model version in use. The legal tech company has no contractual minimum notice period, no alternative model evaluated, and no documented continuity plan. Their vendor contract requires 99.9% uptime but has no specific provision for model version deprecation. Their largest customer's contract requires 90-day service continuity notice.

Audit findings
  • No contractual minimum notice period for model deprecation negotiated with vendor
  • No alternative model or fallback capability identified or tested
  • Vendor SLA covers infrastructure uptime but not model version continuity
  • Customer-facing SLA commitment (90-day notice) cannot be met with 30-day vendor notice
Maturity classification

C8 maturity: L1. A vendor exists and is being used, but vendor risk is unmanaged. No continuity planning, no contractual protections, no inventory of dependencies.

Evidence required
  • Third-party AI inventory listing all external model dependencies with version, vendor, and criticality
  • Vendor contracts including AI-specific SLA terms (deprecation notice, version support period)
  • Alternative model or fallback evaluation documented
  • Continuity plan for each critical AI dependency with tested recovery procedures
Auditor lesson

API availability SLAs and model version continuity SLAs are different things. An AI vendor can be 100% uptime compliant while simultaneously withdrawing the model you depend on. C8 requires specific contractual protections for model continuity.

Ready for the CAIA exam?

30 scenario-based questions covering all 8 PAI-8 controls. Pass threshold: 22/30. Exam fee: $97. Credential valid on the PAI registry upon passing.