The professional standard for production AI deployment

Verify a credential For organisations Partner Programme For nonprofits & NGOs Contact

CAAE · Specialist

Study Guide: Certified Applied AI Engineer

This guide covers all domains tested in the CAAE examination. Each domain includes key concepts, a worked scenario, and the reasoning approach examiners expect.

Take the exam — $79 →All certifications

Exam at a glance

Questions

25 drawn from a 30-question bank

Pass mark

18 correct (72%)

Time limit

45 minutes

Retake cooldown

3 attempts per 24-hour window

Fee

$79

Credential

Digital certificate + registry listing

Domain 1: Prompting Techniques & Reasoning Patterns

~25% of exam

Key Concepts

Zero-shot vs few-shot: when each applies
Chain-of-thought (CoT): step-by-step reasoning
Self-consistency: majority-vote over multiple CoT paths
Tree-of-thought: branching reasoning exploration
ReAct: interleaved reasoning and action
Meta-prompting: prompts that generate prompts

WORKED SCENARIO 1.1

Chain-of-thought improves complex reasoning but fails on simple tasks

Your team is evaluating prompting strategies for a financial document analysis tool. On complex multi-step calculations (e.g., calculating IRR across multiple cash flows), CoT dramatically improves accuracy. But on simple extraction tasks (e.g., 'What is the stated revenue on line 12?'), CoT actually decreases accuracy and increases latency. How do you reconcile this?

Expert Analysis

This is an expected finding: CoT adds value proportional to task complexity. For tasks that require multi-step reasoning, CoT provides a structured path. For tasks that are essentially lookup/extraction, CoT adds irrelevant intermediate steps that can introduce errors.
The solution is task routing: classify queries by complexity, then apply CoT only to complex reasoning tasks. Simple extraction tasks use direct prompting.
Self-consistency (running CoT multiple times and majority-voting) is even more expensive — apply it only to the highest-stakes calculations where accuracy is critical.
The lesson generalises: prompting techniques are not universally better — they are tools suited to specific task characteristics.

Key Lesson: Prompting techniques are tools with appropriate use cases, not magic that always improves performance. The right technique depends on task complexity, error cost, and latency budget.

📋 Exam Tips for This Domain

Expect questions asking you to choose between zero-shot, few-shot, and CoT for a given scenario — match the technique to the task complexity.
Self-consistency is best for tasks with a single correct answer (math, logic) where multiple reasoning paths should converge.
ReAct (Reason + Act) is best for agentic tasks where the model needs to take actions and observe results between reasoning steps.

Domain 2: Structured Output & Output Parsing

~20% of exam

Key Concepts

JSON mode vs schema-constrained generation
Pydantic and Zod schemas for output validation
Output parsing robustness: retry logic, fallback handling
Few-shot examples for output format
Extracting structured data from unstructured text
Type coercion and validation after parsing

WORKED SCENARIO 2.1

Structured output intermittently produces malformed JSON

Your application requests JSON output via a system prompt instruction ('Always respond in valid JSON'). In production, approximately 0.3% of responses are malformed JSON, causing 500 errors. The errors cluster around complex nested schemas. Describe the fix.

Expert Analysis

Prompt-based JSON requests are statistically unreliable — especially for complex nested schemas. 0.3% malformed rate means 1 in 333 requests fails, which is unacceptable for production.
Fix 1 (immediate): Add try/catch around JSON.parse with a retry — on malformed output, call the API again with the malformed response appended and 'The above JSON is malformed. Please fix it and respond with only valid JSON.'
Fix 2 (robust): Switch to the provider's structured output mode (json_schema) which constrains generation at the token level and guarantees valid JSON matching the schema.
Fix 3 (validation): Even with structured output mode, validate against the Pydantic/Zod schema — the model might omit optional fields or use wrong types.

Key Lesson: Schema-constrained generation (structured output mode) is categorically more reliable than prompt-based JSON requests. Use it as the primary approach; reserve retry logic as a safety net for providers that do not support it.

📋 Exam Tips for This Domain

Know the difference between JSON mode (valid JSON, but any structure) and schema-constrained generation (valid JSON conforming to a specific schema).
Retry logic for malformed output: include the malformed response in the retry prompt and ask the model to fix it — this is more reliable than a bare retry.
Type coercion danger: the model may return a number as a string ('42' instead of 42). Always validate types after parsing, even with structured output.

Domain 3: Agent Architectures & Reliability

~20% of exam

Key Concepts

Agent loop: observe-reason-act cycle
Tool/function selection and argument generation
Explicit state management vs context-window-only state
Iteration limits and timeout budgets
Multi-agent coordination patterns
Agent failure modes: hallucinated tool caae, infinite loops, goal abandonment

WORKED SCENARIO 3.1

Agent enters infinite loop on ambiguous goal

You deploy a research agent tasked with 'Find the most recent and comprehensive information about X.' The agent begins iterating — searching, reading, searching again with refined terms — and never terminates. After 47 tool caae and 12 minutes, it is still running. What failed and how do you prevent this?

Expert Analysis

The agent lacks a termination condition. 'Most recent and comprehensive' is inherently unsatisfiable — there is always more information to find. The agent is optimising for an unbounded goal.
What failed: (1) no iteration limit, (2) no time budget, (3) goal specification did not include a 'good enough' termination criterion.
Fixes: (1) hard iteration cap (e.g., max 10 tool caae), (2) wall-clock timeout (e.g., 2 minutes), (3) rephrase the goal with a concrete termination condition ('find the top 5 most recent sources from the last 30 days'), (4) add a meta-reasoning step where the agent assesses whether it has sufficient information to answer before continuing.

Key Lesson: Agent termination conditions must be explicitly defined — the model cannot reliably infer when 'enough' information has been gathered for an open-ended goal. Every agentic loop needs an iteration cap and a timeout budget.

📋 Exam Tips for This Domain

The most common agent failure modes: hallucinating tool caae (calling functions that do not exist), incorrect argument types, and infinite loops. Know all three.
Explicit state management (storing completed steps, remaining steps, intermediate results in a structured object) is more reliable than relying on context window alone for multi-step tasks.
Multi-agent patterns: orchestrator-subagent (one agent coordinates others), peer-to-peer (agents collaborate), and parallel-then-synthesise (agents work simultaneously, results merged).

Domain 4: Evaluation Frameworks & Quality Measurement

~20% of exam

Key Concepts

BLEU, ROUGE-L: what they measure and when they fail
Human evaluation: gold standard with rubric design
LLM-as-Judge: strengths and known biases
Task-specific metrics: F1 for extraction, exact match for Q&A
Benchmark construction: representative, adversarial, regression sets
A/B testing quality signals in production

WORKED SCENARIO 4.1

BLEU score is high but human evaluators rate quality as poor

You have been using BLEU score to evaluate your document summarisation model. BLEU scores average 0.42 (high for the task). However, in a human evaluation study, 35% of summaries are rated as 'poor quality' by domain experts. How do you explain this discrepancy and fix your evaluation approach?

Expert Analysis

BLEU measures n-gram overlap between the generated summary and a reference summary. High BLEU means the generated text shares many word sequences with the reference — but does not measure: coherence, factual accuracy, completeness, or domain correctness.
The most likely explanation: your model is generating text that uses similar words to the reference but in a way that is semantically poor — perhaps extracting literal phrases without understanding context.
The fix: replace or supplement BLEU with: (1) human evaluation on a representative sample with a domain expert rubric, (2) LLM-as-Judge with explicit criteria (accuracy, coherence, completeness, conciseness), (3) task-specific metrics like factual consistency checking.
BLEU is a useful development-time proxy but is not a substitute for human evaluation in production quality assurance.

Key Lesson: Automated metrics like BLEU correlate poorly with human quality judgements for open-ended generation. They are useful for detecting regressions between versions but cannot replace human or LLM-as-Judge evaluation for absolute quality assessment.

📋 Exam Tips for This Domain

BLEU: measures n-gram overlap. ROUGE-L: measures longest common subsequence. Both measure surface similarity, not semantic quality.
LLM-as-Judge biases: verbosity bias, position bias, self-enhancement bias. Mitigate by swapping answer order and averaging, using multiple judges, and calibrating against human judgements.
Benchmark sets should include: representative examples (typical user queries), adversarial examples (known failure modes), and regression examples (bugs that were fixed).

Domain 5: Context Management & Prompt Architecture

~15% of exam

Key Concepts

Context window: maximising usage without silent truncation
Rolling conversation history: summarisation strategies
Retrieval-augmented context vs full-document context
System prompt design: persona, constraints, output format, examples
Prompt compression techniques
Temperature, top-p, top-k: effects on output diversity

WORKED SCENARIO 5.1

Long conversation silently truncates critical instructions

A customer service bot works perfectly for the first 10 turns of a conversation. In turn 15, it starts ignoring its system prompt instructions (e.g., never offering refunds without manager approval). Investigation reveals the conversation history has grown to fill the entire context window, truncating the system prompt. What should have been done architecturally?

Expert Analysis

This is a context window overflow failure. The application was appending conversation turns without managing total token count, and eventually the system prompt was pushed out of the context window by the growing conversation history.
What should have been done: (1) implement a rolling context budget that reserves a fixed allocation for the system prompt (e.g., 1000 tokens always reserved at the start), (2) summarise conversation history when it approaches the context limit — replace verbose turn-by-turn history with a compact summary of key facts and decisions, (3) monitor context window utilisation as a production metric.
Alternative: use a model with a longer context window, but this is more expensive and does not solve the root cause of unmanaged context growth.

Key Lesson: Context windows fill up. Applications must actively manage context — reserving space for critical instructions, summarising history, and monitoring utilisation. Silent truncation of system prompts is one of the most dangerous production LLM failures.

📋 Exam Tips for This Domain

Temperature = 0 is deterministic (always highest-probability token). Temperature = 1 samples from the distribution. Above 1 is typically too random for production.
Top-p = 0.9 samples from the 90% probability mass — a good default for creative tasks. Top-p = 0.1 is very conservative (near-deterministic).
Prompt compression techniques (like LLMLingua) can reduce token count by 30-50% with minimal quality impact — useful when context is tight and cannot grow.

Ready to sit the examination?

You now have the conceptual foundation. Expect applied-reasoning questions — read each scenario and identify which automation control prevents the described failure.

Purchase Exam Access — $79 →