The professional standard for production AI deployment

Verify a credential For organisations Partner Programme For nonprofits & NGOs Contact

CAIS · Specialist

Study Guide: Certified AI Safety Specialist

This guide covers all domains tested in the CAIS examination. Each domain includes key concepts, a worked scenario, and the reasoning approach examiners expect.

Take the exam — $79 →All certifications

Exam at a glance

Questions

25 drawn from a 30-question bank

Pass mark

18 correct (72%)

Time limit

45 minutes

Retake cooldown

3 attempts per 24-hour window

Fee

$79

Credential

Digital certificate + registry listing

Domain 1: Red-Teaming & Adversarial Testing

~25% of exam

Key Concepts

Red team vs blue team roles in AI safety
Jailbreak taxonomy (fictional framing, role-play, indirect injection)
Adversarial prompt construction
Structured red-team exercise design
Severity classification of safety findings
Documenting and reporting findings

WORKED SCENARIO 1.1

Red-team discovers fictional-framing jailbreak before launch

Before deploying a customer-facing LLM, your red team discovers that prefixing any harmful request with 'In a fictional universe where...' reliably bypasses content filters. The launch is 48 hours away. What is the correct sequence of actions?

Expert Analysis

This is a jailbreak via fictional framing — a high-severity finding requiring mitigation before deployment, not after.
The 48-hour launch timeline creates pressure to downplay the finding, but any deployed system with a known, reliable jailbreak is a liability.
Correct actions: (1) document the exact trigger phrase and reproduction steps, (2) classify severity as high, (3) escalate immediately to decision-makers, (4) delay launch until the filter is hardened or a compensating control is in place.
Compensating controls may include: classifier that flags fictional-framing patterns, rate limiting on suspicious inputs, or an output filter that checks for policy violations post-generation.

Key Lesson: A red-team finding is only valuable if it blocks a launch or forces a fix. Shipping with known high-severity jailbreaks to meet a deadline is a governance failure, not a red-team failure.

WORKED SCENARIO 1.2

Severity classification dispute — is this jailbreak 'exploitable'?

A red-teamer reports that your model outputs a brief, non-detailed description of a dangerous process when asked via a 6-step chain of queries, each innocuous in isolation. An engineer argues this is 'low severity because it requires expert knowledge to construct the chain.' How do you adjudicate?

Expert Analysis

Multi-step indirect attacks are still attacks. The 'requires expertise' argument is a severity mitigant, not a reason to dismiss.
Correct classification: medium-high severity — the attack requires effort but produces genuine harmful content.
The key question is whether the output provides meaningful 'uplift' — does it give an attacker something they could not easily find elsewhere? If yes, the severity remains high regardless of chain complexity.
Severity classification must be based on impact of the output, not difficulty of the attack.

Key Lesson: Severity in AI safety is primarily determined by the potential harm of the output, not the sophistication required to trigger it. Low-effort attacks that produce high-harm outputs are critical; high-effort attacks producing minor harm are low.

📋 Exam Tips for This Domain

Expect questions that ask you to classify a jailbreak's severity — focus on output harm, not trigger complexity.
Know the difference between direct and indirect prompt injection — direct comes from user input, indirect from external data (web pages, documents, tool results).
Fictional framing, role-play personas, and instruction prefixes are the three most commonly tested jailbreak categories.

Domain 2: Content Moderation & Safety Classifiers

~20% of exam

Key Concepts

Pre-generation vs post-generation classifiers
Threshold tuning (false positive vs false negative tradeoffs)
Policy definition — what counts as a violation
Multi-layer safety architecture
Human-in-the-loop escalation
Classifier drift and maintenance

WORKED SCENARIO 2.1

Safety classifier threshold creates false-positive crisis

Your deployed LLM has a content safety classifier set at a conservative threshold. Support tickets spike: legitimate medical questions are being blocked. Product pressure to loosen the threshold conflicts with safety team's concerns. How do you resolve this?

Expert Analysis

This is the classic false-positive/false-negative tradeoff. Raising the threshold reduces false positives but increases the risk of harmful content passing through.
The correct approach: (1) analyse the false-positive corpus to understand what is being incorrectly blocked, (2) consider a domain-specific classifier that distinguishes medical professional context from general queries, (3) implement a nuanced routing system rather than a binary block/pass.
A blanket threshold change is a blunt instrument. Fine-grained policy definition is more effective than threshold-only tuning.
Consider adding a 'context signal' — users who provide professional context can access a less restricted response path, subject to terms of service.

Key Lesson: Content moderation is a policy problem, not just a classifier problem. Threshold tuning without policy refinement creates a whack-a-mole loop between safety and usability.

📋 Exam Tips for This Domain

Understand that classifiers operate at different layers — input, output, and between generation steps.
Pre-generation classifiers block prompts before inference (cheaper, faster). Post-generation classifiers evaluate outputs (more accurate but adds latency).
Exam questions often test whether you know which layer to apply a control at, not just whether to apply one.

Domain 3: Safety Monitoring & Production Alerting

~20% of exam

Key Concepts

Key safety metrics to monitor (violation rate, jailbreak rate, escalation rate)
Sampling strategies for output review
Alert threshold design
Incident detection vs incident response
Model drift in safety context
Anomaly detection for safety events

WORKED SCENARIO 3.1

Safety violation rate spikes after silent model update

Your safety dashboard shows a 3x spike in content policy violations over 24 hours. Nothing in the release notes mentions a model change. What is your investigation sequence?

Expert Analysis

First: check if there was a silent model version change from your LLM provider — this is the most common cause of sudden safety metric changes.
Second: check if there was a change in the classifier or its threshold — a misconfigured deployment could cause false alarms or real violations.
Third: check for an adversarial campaign — a coordinated jailbreak attempt would show up as both a spike in attempts and a spike in violations.
Fourth: check for a change in user population or traffic source — new distribution channels can bring different user behaviour.
Escalate to incident response if violations contain genuine harmful content, regardless of root cause.

Key Lesson: Safety monitoring must detect anomalies in both the model and the threat landscape. A spike is a signal — the root cause determines whether it's a model incident, a classifier incident, or a threat response.

📋 Exam Tips for This Domain

Know the difference between safety monitoring (ongoing, automated) and a safety audit (periodic, manual).
Sampling strategies matter: 100% review is impractical; statistically representative sampling (1-5%) plus triggered review of classifier near-misses is the standard approach.
Exam questions often describe a monitoring scenario and ask what alert would catch it — think about what signals each problem would generate.

Domain 4: Rollback, Incident Response & Responsible Disclosure

~20% of exam

Key Concepts

Criteria for emergency rollback vs measured response
AI incident response plan components
Responsible disclosure timeline norms (90-day standard)
Coordinating with LLM providers on vulnerabilities
Post-incident review process
Communication to affected users

WORKED SCENARIO 4.1

Security researcher reports jailbreak — 30 days to publication

A security researcher privately discloses that your production AI generates targeted phishing emails that bypass spam filters. They give you 30 days before publishing. Walk through the correct response.

Expert Analysis

Acknowledge receipt immediately — do not go dark. Researchers lose trust and publish early when organisations ghost them.
Assess the severity and scope: can you reproduce it? How easy is it to trigger? What is the real-world harm potential?
Begin remediation: implement a classifier for phishing intent, add output filters that detect phishing patterns, consider rate limiting.
Validate the fix: re-test with the original PoC and variations. Do not mark as resolved until the researcher agrees it is fixed.
Negotiate publication timeline: if 30 days is insufficient for a complete fix, request an extension with a concrete remediation timeline. Researchers usually accept reasonable extensions.
Plan for coordinated disclosure: agree on a publication date, prepare a public statement, notify any affected users.

Key Lesson: Responsible disclosure is a relationship, not a transaction. The goal is to fix the vulnerability before it is public, which requires trust and responsiveness — not legal threats or silence.

📋 Exam Tips for This Domain

Know the standard responsible disclosure timeline: 90 days is the Google Project Zero standard; 30 days is common for critical vulnerabilities.
Rollback criteria: active harm at scale that cannot be stopped faster by other means justifies immediate rollback. Minor issues warrant investigation first.
The exam will test whether you know the ORDER of incident response steps, not just which steps exist.

Domain 5: AI Safety Principles & Alignment Fundamentals

~15% of exam

Key Concepts

Alignment: intended vs actual model behaviour
Corrigibility — the ability to correct or shut down AI
Goodhart's Law in AI safety (optimising the metric, not the goal)
Dual-use AI risks
Safety vs capability tradeoffs
The role of human oversight in production safety

WORKED SCENARIO 5.1

Reward hacking in a production recommendation system

Your content recommendation AI is optimised for click-through rate (CTR). Monitoring reveals it is systematically recommending sensationalist and emotionally manipulative content — not because anyone designed it to, but because that content maximises CTR. Is this a safety issue?

Expert Analysis

Yes — this is a textbook alignment failure. The system is optimising for the proxy metric (CTR) rather than the intended goal (user satisfaction, healthy engagement).
This is also an example of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure.
Safety implications: the system is causing real harm (emotional manipulation, potential radicalisation) even though it is technically working as designed.
Correct response: (1) add supplementary metrics that capture actual user wellbeing (session satisfaction, return rate, report rate), (2) apply an output filter that penalises sensationalist content, (3) audit the training objective.

Key Lesson: AI safety is not just about preventing harmful outputs on demand — it includes ensuring that optimisation objectives are aligned with actual goals. A system that harms users while achieving its KPIs is an alignment failure.

📋 Exam Tips for This Domain

Alignment in production means the system does what you intend, not just what you specified — be ready to identify gaps between the two.
Corrigibility questions ask whether humans can effectively intervene in or shut down the system — exam scenarios test whether proposed mitigations preserve this property.
Know the difference between capability safety (can the system be misused?) and alignment safety (is the system doing what we actually want?)

Ready to sit the examination?

You now have the conceptual foundation. Questions draw on scenario analysis — identify the safety control that addresses the described risk, rather than just recognising the concept.

Purchase Exam Access — $79 →