The professional standard for production AI deployment
Verify a credentialFor organisationsPartner ProgrammeFor nonprofits & NGOsContact
CAIA · Specialist

Study Guide: Certified AI Auditor

This guide covers all domains tested in the CAIA examination — assessed against the PAI-8 AI Safety Standard. Each domain includes key concepts, worked audit scenarios, and the evidence-gathering approach required at each maturity level.

Take the exam — $97 →All certifications

Exam at a glance

Questions
30 drawn from a 40-question bank
Pass mark
22 correct (73%)
Standard
PAI-8 AI Safety Standard
Domains
C1–C8 + Audit Methodology
Fee
$97
Credential
Digital certificate + registry listing

Understanding the PAI-8 Maturity Model

Every PAI-8 control is assessed against a four-level maturity scale. Knowing which level requires which evidence is central to the CAIA exam.

L0 — Absent
No control in place. Policy or process does not exist.
L1 — Basic
Control exists on paper. Communicated, with an accountable owner.
L2 — Managed
Control is operational. Applied to real decisions, with regular cadence.
L3 — Optimised
Continuously improved. Evidence of learning and measurable effectiveness.

PAI-8 Control domains

ControlDomain
C1AI Governance
C2Risk Assessment
C3Data Stewardship
C4Model Validation
C5Human Oversight
C6Incident Response
C7Audit Trail
C8Vendor & Supply Chain
+Audit Methodology

C1: AI Governance

Key Concepts

  • Published AI ethics policy with active governance
  • Single accountable owner with defined RACI
  • Governance committee with documented meeting cadence
  • Decision gate records showing governance applied to real deployments
  • AI use case registry with risk classifications
  • Contested accountability = C1 gap finding
  • L1 requires: policy + communication + accountable owner
  • L2 requires: committee + cadence + real decision integration
WORKED SCENARIO C1.1

Organisation has a published AI ethics policy — what maturity level?

An organisation has a published AI ethics policy but no evidence it has been communicated to staff or embedded in any decision-making process. What PAI-8 maturity level does this represent?

Expert Analysis
  • L0 — Absent. A policy that exists only on paper with no operational embedding is not governance — it is documentation. L0 means no active control.
  • L1 requires three things: the policy exists AND is communicated to staff AND has an accountable owner. One without the others is L0.
  • The common mistake is treating publication as governance. Publication is a necessary but not sufficient condition for L1.
  • Evidence for L1: policy document + staff awareness training records + named AI governance owner in org structure.
Key Lesson: A policy with no operational embedding is L0. The maturity model rewards active governance, not documentation compliance. Always ask: is this being applied?
WORKED SCENARIO C1.2

CISO and CTO give conflicting answers on AI governance ownership

During a C1 audit interview, the CISO says they own all AI risk. The CTO says AI governance is an ethics matter owned by Legal. What is your finding?

Expert Analysis
  • This is a C1 gap finding. Unclear or contested accountability for AI risk indicates the governance framework is not operational — it is theoretical.
  • Effective governance requires a single named accountable owner with a defined RACI (Responsible, Accountable, Consulted, Informed). Multiple people claiming or deflecting responsibility is the same problem with different framing.
  • This finding should be documented as: "No single accountable owner for AI governance risk identified. C1 accountability control is not satisfied at any maturity level."
  • It is not "effective cross-functional governance" — distributed accountability without a single owner creates accountability gaps at decision points.
Key Lesson: Contested accountability is a red flag. Governance that cannot name a single owner with a clear RACI is governance that will not function under pressure.
📋 Exam Tips for C1
  • L2 evidence must include decision gate records showing governance was actually applied to real deployment decisions — committee existence alone is not L2.
  • Questions often describe a scenario that sounds like governance exists — look for evidence that it is operational, not just documented.
  • An AI use case registry with risk classifications is a characteristic L2 artefact — it shows governance is applied to individual systems, not just stated at policy level.

C2: Risk Assessment

Key Concepts

  • Trigger-based reassessment (model changes, use case expansion, regulation)
  • Annual baseline + trigger events = L2+ requirement
  • High-risk criteria: impact + regulatory domain + failure consequence
  • Risk register: identified risks linked to controls with owners
  • Treatment plans required for managed risk (L2)
  • EU AI Act Annex III domains for high-risk classification
  • Risk rating without controls = incomplete risk management
  • A risk register entry needs: risk + rating + control + owner + treatment
WORKED SCENARIO C2.1

Annual risk assessment — model upgrade occurred six months ago with no reassessment

An organisation conducts AI risk assessments annually, but a major model upgrade occurred six months ago without triggering a reassessment. What is your C2 finding?

Expert Analysis
  • C2 gap — PAI-8 C2 requires trigger-based reassessment in addition to the annual baseline. Significant model changes, use case expansions, and regulatory updates are all triggers. Annual frequency alone is only L1.
  • A major model upgrade changes the risk profile of the system. The risk assessment performed before the upgrade is now stale relative to the current deployment.
  • This is not a minor observation — waiting for the annual cycle after a material change is a managed-risk gap finding, not a timing issue.
  • Correct assessment: C2 is operating at L1 (annual baseline only). L2 requires trigger-based reassessment applied after this upgrade.
Key Lesson: Annual-only risk assessment is L1. L2 requires a trigger catalogue — model changes, scope expansion, regulatory updates — each prompting reassessment before the next annual cycle.
📋 Exam Tips for C2
  • High-risk classification is multi-criteria: individual impact significance + regulatory domain + potential harm on failure. Personal data processing alone does not trigger high-risk.
  • A risk register entry without controls, an owner, or a treatment plan is an incomplete finding — flag it as L0/L1 gap regardless of how well the risk is described or rated.
  • Know the distinction: a risk in a register is not the same as a managed risk. Management requires controls + accountability + treatment plans.

C3: Data Stewardship

Key Concepts

  • Training data provenance and lineage documentation
  • Data quality validation before training or RAG ingestion
  • GDPR lawful basis for AI training data
  • Consent scope: data collected for X ≠ consent to train AI
  • Data minimisation in AI context
  • Bias auditing of training and evaluation datasets
  • Data access controls for model inputs and outputs
  • Retention and deletion obligations for AI-processed data
WORKED SCENARIO C3.1

Customer service data used to fine-tune a model — consent issue?

An organisation uses historical customer service chat logs to fine-tune their LLM. The original privacy notice said data was collected "to provide customer service." Is this a C3 finding?

Expert Analysis
  • Yes — a C3 finding on data provenance and lawful basis. Data collected for "providing customer service" does not automatically cover using that data for AI model training. These are materially different purposes.
  • Under GDPR, secondary use of personal data requires either: (a) compatible purpose assessment, (b) fresh consent for the new purpose, or (c) a new lawful basis.
  • C3 maturity requires documented lineage showing the legal basis for each training data source. If lineage documentation does not address the secondary use question, it is a gap.
  • The finding: C3 data provenance control not satisfied. Recommend legal review of purpose compatibility before continuing fine-tuning on this dataset.
Key Lesson: Consent or lawful basis for data collection does not automatically extend to AI training. Purpose scope must be explicitly assessed and documented for every training data source.
📋 Exam Tips for C3
  • The key C3 audit question is always: can you show the legal basis for using this data in this AI system? If the answer is "we always had it," that is not sufficient.
  • Bias in training data is a C3 finding, not just a fairness issue. The audit question is: was the training dataset assessed for representativeness and bias before use?
  • Data quality validation before ingestion (into training or RAG) is a C3 control. Ingesting unvalidated data is a gap even if the data is lawfully held.

C4: Model Validation

Key Concepts

  • Pre-deployment validation against curated eval suite
  • Benchmark selection: task-representative, adversarial, edge cases
  • Out-of-distribution detection during validation
  • Hallucination rate measurement methodology
  • Comparing model versions on same eval suite
  • L2 validation: documented, repeatable process with defined pass threshold
  • L3 validation: continuous post-deployment measurement feeding back to development
  • Validation artefacts: eval set, methodology, results, pass/fail criteria
WORKED SCENARIO C4.1

Organisation validates models with ad-hoc prompts before deployment

Before deploying a new LLM, the team tests it with 10–15 ad-hoc prompts chosen by the lead engineer. There is no defined eval set, no pass threshold, and no documentation. What C4 maturity level?

Expert Analysis
  • L1 at best — informal validation exists, but it is not a managed process. L2 C4 requires: a documented, curated eval set, a defined pass threshold, a repeatable methodology, and documented results.
  • Ad-hoc testing is selection-biased — the lead engineer will test prompts they expect to pass. A curated eval set includes adversarial cases and edge cases specifically designed to find failure modes.
  • Without documentation, validation is non-reproducible. You cannot compare results across versions without consistent methodology.
  • C4 finding: "Pre-deployment validation exists informally (L1). Documented, curated eval suite with defined pass criteria required to achieve L2."
Key Lesson: Informal testing is L1. L2 validation requires a defined, documented, repeatable process with an explicit pass threshold — not a vibe check before deployment.
📋 Exam Tips for C4
  • The four required L2 C4 artefacts: (1) curated eval set, (2) documented methodology, (3) defined pass threshold, (4) recorded results. Missing any one = L1.
  • L3 C4 requires a feedback loop from post-deployment performance back to the eval suite — the eval suite must evolve based on real-world failures.
  • Questions about comparing model versions always require the same eval set used consistently — ad-hoc comparison between versions is not valid evidence.

C5: Human Oversight

Key Concepts

  • Oversight process definition: who reviews, what triggers review, how often
  • Reviewer calibration: consistent application of oversight criteria
  • Evidence of human review catching AI errors before harm
  • Feedback loop from oversight findings to model improvement
  • L2: documented oversight process applied regularly
  • L3: continuous improvement, measurable effectiveness data
  • Escalation criteria for high-stakes AI outputs
  • Sampling strategy for output review at scale
WORKED SCENARIO C5.1

What evidence distinguishes L2 from L3 in C5 Human Oversight?

An organisation has a documented AI output review process that is applied regularly, with defined reviewers and escalation criteria. What additional evidence is needed to achieve C5 L3?

Expert Analysis
  • L2 is demonstrated by what they already have: documented process, regular cadence, defined reviewers, escalation criteria. This is managed oversight.
  • L3 (Optimised) requires continuous improvement evidence: documented cases where human review caught AI errors that would have caused harm, reviewer calibration data showing consistent application of criteria, and a feedback loop from oversight findings back to model or prompt improvement.
  • The key L3 distinguisher is the feedback loop — L2 oversight catches errors; L3 oversight uses those catches to make the system better over time.
  • L3 evidence artefacts: error catch log, calibration session records, improvement actions taken in response to oversight findings.
Key Lesson: L2 oversight runs the process. L3 oversight learns from it. The feedback loop from findings to improvement is the distinguishing evidence for L3.
📋 Exam Tips for C5
  • C5 questions often describe an oversight process and ask you to identify the maturity level. Focus on: is there a feedback loop? Is there calibration data? These push from L2 to L3.
  • 100% human review is operationally impractical at scale. L2 C5 typically uses a sampling strategy — the question is whether the sample is statistically representative and consistently applied.
  • Reviewer calibration (ensuring consistent criteria application across reviewers) is a C5 control, not a training exercise. Lack of calibration is a C5 gap even if reviews occur regularly.

C6: Incident Response

Key Concepts

  • AI-specific incident definition (not just infrastructure incidents)
  • Incident response plan: detection, containment, investigation, remediation, review
  • Rollback criteria: when does an incident warrant immediate model shutdown
  • Post-incident review with root cause and preventive actions
  • Responsible disclosure policy and timeline
  • Notifiable AI incidents: who must be told and when
  • Incident log with AI-specific categorisation
  • Tabletop exercises to test IR plan readiness
WORKED SCENARIO C6.1

Organisation uses their IT incident response process for AI incidents — C6 finding?

An organisation has a mature IT incident response process but no AI-specific incident definitions, categorisation, or response procedures. They handle AI incidents as IT incidents. What is your C6 finding?

Expert Analysis
  • C6 gap — AI incidents have characteristics that standard IT IR processes do not address: model behaviour as a root cause, prompt injection as an attack vector, data poisoning, output-level harm that may not trigger infrastructure alerts, and responsible disclosure to regulators or affected individuals.
  • Handling AI incidents as IT incidents means the investigation process may not check AI-specific root causes (model version change, prompt change, retrieval degradation, training data issue).
  • The finding is L1 at most — a process exists, but it is not AI-specific. L2 C6 requires: AI incident definition, AI-specific categorisation, and response steps tailored to AI failure modes.
  • Recommendation: extend the IR plan with an AI-specific annex covering: detection signals, AI root causes, rollback criteria, and notification obligations.
Key Lesson: A general IT IR process adapted for AI is L1, not L2. AI incidents require AI-specific definitions, root cause investigation steps, and notification pathways.
📋 Exam Tips for C6
  • The key C6 audit test: does the IR process include AI-specific failure modes? If it only covers infrastructure failure, it is not C6-compliant at L2.
  • Post-incident review with root cause and documented preventive actions is the L2 bar. L3 requires evidence that previous incident learnings actually changed the system.
  • Rollback criteria must be documented in advance — not decided during an incident. The C6 finding for organisations without rollback criteria is a gap regardless of IR process maturity elsewhere.

C7: Audit Trail

Key Concepts

  • Minimum log contents: input, output, model version, timestamp, user ID
  • System prompt version as required audit log field
  • Retention periods: compliance floor + GDPR data minimisation ceiling
  • Tamper-evident log storage
  • Access controls on audit logs
  • Completeness testing: can you reconstruct an incident from the log?
  • Log coverage: all AI touchpoints, not just user-facing outputs
  • GDPR: retention must be bounded — indefinite is not compliant
WORKED SCENARIO C7.1

Audit log captures input and output but not model version or system prompt version

An organisation logs all AI interactions: user input, AI output, and timestamp. Model version and system prompt version are not logged. What is your C7 finding?

Expert Analysis
  • C7 gap — input + output + timestamp is the minimum operational log, but insufficient for investigation. Without model version and system prompt version, you cannot reproduce the conditions that produced any given output.
  • If an incident occurs, the investigation question "what model version and system prompt produced this output?" cannot be answered from the existing logs. This makes incident investigation and root cause analysis unreliable.
  • C7 L2 requires a completeness test: "Can we reconstruct the full conditions of any interaction from the audit log?" Missing model and prompt version fails this test.
  • Finding: C7 is operating at L1 (basic logging). L2 requires complete reproduction-capable logs including model identifier, system prompt version, and all tool calls if agents are in use.
Key Lesson: Audit logs for AI must support incident investigation, not just operational visibility. Model version and system prompt version are the most commonly missing fields.
📋 Exam Tips for C7
  • The completeness test is the key C7 audit technique: given only the log, can you fully reconstruct the conditions of an interaction? If not, what is missing?
  • Retention: GDPR data minimisation means indefinite retention is a liability, not a best practice. Sector regulation sets the minimum retention floor; GDPR data minimisation sets the ceiling.
  • Tamper-evidence and access controls on the audit log are C7 security controls. Logs that can be altered or accessed without restriction undermine the integrity of the audit trail.

C8: Vendor & Supply Chain

Key Concepts

  • AI vendor due diligence: safety practices, model cards, incident history
  • Third-party model risk: inherited hallucination rates, bias, safety gaps
  • Data processing agreements covering AI-processed personal data
  • Model card review as part of vendor assessment
  • Supply chain AI risk: AI embedded in SaaS tools not directly procured
  • Provider-side silent model updates: detection and response
  • Contractual obligations: notification of material model changes
  • Vendor concentration risk in AI supply chain
WORKED SCENARIO C8.1

Procurement signs contract with LLM provider — no model change notification clause

An organisation has signed a contract with an LLM API provider. There is no clause requiring the provider to notify the organisation of material model changes. What is the C8 finding?

Expert Analysis
  • C8 gap — vendor risk management at L2 requires contractual obligations for material events including model changes, data processing practices, and security incidents. Absence of a notification clause leaves the organisation unable to plan for or detect provider-side changes.
  • Silent model updates from providers are a documented operational risk (covered in CAOP Domain 2). Without contractual notification, the organisation may not discover a change until it causes a production incident.
  • The C8 recommendation: contract amendment to include: (a) advance notice of material model changes, (b) security incident notification timeline, (c) data sub-processing obligations.
  • Also review: does the data processing agreement cover AI inference? Personal data sent to an LLM API may require explicit DPA coverage for GDPR compliance.
Key Lesson: AI vendor contracts must address model change notification as a material contractual obligation. Operational risk management of AI depends on knowing when the tools change.
📋 Exam Tips for C8
  • C8 covers both direct AI vendors (LLM APIs, AI SaaS) and indirect AI supply chain (AI embedded in tools not procured as AI). Ask: where else does AI enter the organisation's systems?
  • Model card review is a C8 due diligence artefact. Evidence request: "What model cards did you review before deploying this third-party model?"
  • Data processing agreement (DPA) coverage for AI inference is a GDPR obligation. Sending personal data to an LLM API without a DPA is both a C8 finding and a regulatory risk.

Audit Methodology

Key Concepts

  • Evidence-based maturity assessment (not self-reported)
  • Interview techniques: corroborate claims with artefacts
  • The audit finding ladder: observation → finding → gap
  • Drafting findings: specific, evidenced, actionable
  • Audit report structure for AI safety assessments
  • Rating disputes: how to handle when auditee disagrees
  • Scope definition: which AI systems are in scope
  • Independence requirements: assessor must not assess own work
WORKED SCENARIO M.1

Auditee claims L2 governance — what evidence do you request?

During a C1 interview, the Head of AI says "we have L2 governance — we have a committee, we meet monthly, and we apply it to all AI deployments." What evidence do you request to substantiate this claim?

Expert Analysis
  • Never accept a self-assessed maturity rating without corroborating evidence. The audit role is evidence-based assessment, not interview-based endorsement.
  • For the L2 C1 claim, request: (1) committee charter or terms of reference, (2) meeting minutes from the last 12 months, (3) decision gate records showing governance applied to specific deployments, (4) AI use case registry. The minutes should reference actual deployment decisions, not just be standing agenda items.
  • If the organisation provides meeting minutes that discuss AI governance generally but contain no records of specific deployment decisions being reviewed, this is evidence for L1, not L2.
  • The audit finding must be supported by evidence on file — not the auditor's impression of the interview.
Key Lesson: A self-reported maturity level is a claim, not a finding. Every maturity assessment must be backed by specific, named evidence artefacts. The evidence determines the rating, not the claim.
📋 Exam Tips for Audit Methodology
  • Audit questions often present a scenario where the auditee claims a high maturity level — your job is to identify what evidence is missing or contradicted by the scenario.
  • The finding ladder: an observation becomes a finding when it represents a gap against the standard. A gap requires a recommendation. Know the difference between noting something and finding something.
  • Independence is a PAI audit requirement. An assessor cannot evaluate controls they designed or implemented. This is tested in scenarios where the internal AI team proposes to self-certify.

Ready to sit the examination?

You now have the PAI-8 framework and audit methodology foundation. The CAIA exam tests applied reasoning — read each scenario carefully, identify which control domain applies, assess the evidence against the maturity level criteria, and select the most precise finding.

Purchase Exam Access — $97 →