The professional standard for production AI deployment
Verify a credentialFor organisationsPartner ProgrammeFor nonprofits & NGOsContact
CAAE · Specialist

Study Guide: Certified Applied AI Engineer

This guide covers all domains tested in the CAAE examination. Each domain includes key concepts, a worked scenario, and the reasoning approach examiners expect.

Take the exam — $79 →All certifications

Exam at a glance

Questions
Questions
25 drawn from a 30-question bank
Pass mark
18 correct (72%)
Time limit
45 minutes
Retake cooldown
3 attempts per 24-hour window
Fee
$79
Credential
Digital certificate + registry listing

Domain 1: Prompting Techniques & Reasoning Patterns

~25% of exam

Key Concepts

  • Zero-shot vs few-shot: when each applies
  • Chain-of-thought (CoT): step-by-step reasoning
  • Self-consistency: majority-vote over multiple CoT paths
  • Tree-of-thought: branching reasoning exploration
  • ReAct: interleaved reasoning and action
  • Meta-prompting: prompts that generate prompts
WORKED SCENARIO 1.1

Chain-of-thought improves complex reasoning but fails on simple tasks

Your team is evaluating prompting strategies for a financial document analysis tool. On complex multi-step calculations (e.g., calculating IRR across multiple cash flows), CoT dramatically improves accuracy. But on simple extraction tasks (e.g., 'What is the stated revenue on line 12?'), CoT actually decreases accuracy and increases latency. How do you reconcile this?

Expert Analysis
  • This is an expected finding: CoT adds value proportional to task complexity. For tasks that require multi-step reasoning, CoT provides a structured path. For tasks that are essentially lookup/extraction, CoT adds irrelevant intermediate steps that can introduce errors.
  • The solution is task routing: classify queries by complexity, then apply CoT only to complex reasoning tasks. Simple extraction tasks use direct prompting.
  • Self-consistency (running CoT multiple times and majority-voting) is even more expensive — apply it only to the highest-stakes calculations where accuracy is critical.
  • The lesson generalises: prompting techniques are not universally better — they are tools suited to specific task characteristics.
Key Lesson: Prompting techniques are tools with appropriate use cases, not magic that always improves performance. The right technique depends on task complexity, error cost, and latency budget.
📋 Exam Tips for This Domain
  • Expect questions asking you to choose between zero-shot, few-shot, and CoT for a given scenario — match the technique to the task complexity.
  • Self-consistency is best for tasks with a single correct answer (math, logic) where multiple reasoning paths should converge.
  • ReAct (Reason + Act) is best for agentic tasks where the model needs to take actions and observe results between reasoning steps.

Domain 2: Structured Output & Output Parsing

~20% of exam

Key Concepts

  • JSON mode vs schema-constrained generation
  • Pydantic and Zod schemas for output validation
  • Output parsing robustness: retry logic, fallback handling
  • Few-shot examples for output format
  • Extracting structured data from unstructured text
  • Type coercion and validation after parsing
WORKED SCENARIO 2.1

Structured output intermittently produces malformed JSON

Your application requests JSON output via a system prompt instruction ('Always respond in valid JSON'). In production, approximately 0.3% of responses are malformed JSON, causing 500 errors. The errors cluster around complex nested schemas. Describe the fix.

Expert Analysis
  • Prompt-based JSON requests are statistically unreliable — especially for complex nested schemas. 0.3% malformed rate means 1 in 333 requests fails, which is unacceptable for production.
  • Fix 1 (immediate): Add try/catch around JSON.parse with a retry — on malformed output, call the API again with the malformed response appended and 'The above JSON is malformed. Please fix it and respond with only valid JSON.'
  • Fix 2 (robust): Switch to the provider's structured output mode (json_schema) which constrains generation at the token level and guarantees valid JSON matching the schema.
  • Fix 3 (validation): Even with structured output mode, validate against the Pydantic/Zod schema — the model might omit optional fields or use wrong types.
Key Lesson: Schema-constrained generation (structured output mode) is categorically more reliable than prompt-based JSON requests. Use it as the primary approach; reserve retry logic as a safety net for providers that do not support it.
📋 Exam Tips for This Domain
  • Know the difference between JSON mode (valid JSON, but any structure) and schema-constrained generation (valid JSON conforming to a specific schema).
  • Retry logic for malformed output: include the malformed response in the retry prompt and ask the model to fix it — this is more reliable than a bare retry.
  • Type coercion danger: the model may return a number as a string ('42' instead of 42). Always validate types after parsing, even with structured output.

Domain 3: Agent Architectures & Reliability

~20% of exam

Key Concepts

  • Agent loop: observe-reason-act cycle
  • Tool/function selection and argument generation
  • Explicit state management vs context-window-only state
  • Iteration limits and timeout budgets
  • Multi-agent coordination patterns
  • Agent failure modes: hallucinated tool caae, infinite loops, goal abandonment
WORKED SCENARIO 3.1

Agent enters infinite loop on ambiguous goal

You deploy a research agent tasked with 'Find the most recent and comprehensive information about X.' The agent begins iterating — searching, reading, searching again with refined terms — and never terminates. After 47 tool caae and 12 minutes, it is still running. What failed and how do you prevent this?

Expert Analysis
  • The agent lacks a termination condition. 'Most recent and comprehensive' is inherently unsatisfiable — there is always more information to find. The agent is optimising for an unbounded goal.
  • What failed: (1) no iteration limit, (2) no time budget, (3) goal specification did not include a 'good enough' termination criterion.
  • Fixes: (1) hard iteration cap (e.g., max 10 tool caae), (2) wall-clock timeout (e.g., 2 minutes), (3) rephrase the goal with a concrete termination condition ('find the top 5 most recent sources from the last 30 days'), (4) add a meta-reasoning step where the agent assesses whether it has sufficient information to answer before continuing.
Key Lesson: Agent termination conditions must be explicitly defined — the model cannot reliably infer when 'enough' information has been gathered for an open-ended goal. Every agentic loop needs an iteration cap and a timeout budget.
📋 Exam Tips for This Domain
  • The most common agent failure modes: hallucinating tool caae (calling functions that do not exist), incorrect argument types, and infinite loops. Know all three.
  • Explicit state management (storing completed steps, remaining steps, intermediate results in a structured object) is more reliable than relying on context window alone for multi-step tasks.
  • Multi-agent patterns: orchestrator-subagent (one agent coordinates others), peer-to-peer (agents collaborate), and parallel-then-synthesise (agents work simultaneously, results merged).

Domain 4: Evaluation Frameworks & Quality Measurement

~20% of exam

Key Concepts

  • BLEU, ROUGE-L: what they measure and when they fail
  • Human evaluation: gold standard with rubric design
  • LLM-as-Judge: strengths and known biases
  • Task-specific metrics: F1 for extraction, exact match for Q&A
  • Benchmark construction: representative, adversarial, regression sets
  • A/B testing quality signals in production
WORKED SCENARIO 4.1

BLEU score is high but human evaluators rate quality as poor

You have been using BLEU score to evaluate your document summarisation model. BLEU scores average 0.42 (high for the task). However, in a human evaluation study, 35% of summaries are rated as 'poor quality' by domain experts. How do you explain this discrepancy and fix your evaluation approach?

Expert Analysis
  • BLEU measures n-gram overlap between the generated summary and a reference summary. High BLEU means the generated text shares many word sequences with the reference — but does not measure: coherence, factual accuracy, completeness, or domain correctness.
  • The most likely explanation: your model is generating text that uses similar words to the reference but in a way that is semantically poor — perhaps extracting literal phrases without understanding context.
  • The fix: replace or supplement BLEU with: (1) human evaluation on a representative sample with a domain expert rubric, (2) LLM-as-Judge with explicit criteria (accuracy, coherence, completeness, conciseness), (3) task-specific metrics like factual consistency checking.
  • BLEU is a useful development-time proxy but is not a substitute for human evaluation in production quality assurance.
Key Lesson: Automated metrics like BLEU correlate poorly with human quality judgements for open-ended generation. They are useful for detecting regressions between versions but cannot replace human or LLM-as-Judge evaluation for absolute quality assessment.
📋 Exam Tips for This Domain
  • BLEU: measures n-gram overlap. ROUGE-L: measures longest common subsequence. Both measure surface similarity, not semantic quality.
  • LLM-as-Judge biases: verbosity bias, position bias, self-enhancement bias. Mitigate by swapping answer order and averaging, using multiple judges, and calibrating against human judgements.
  • Benchmark sets should include: representative examples (typical user queries), adversarial examples (known failure modes), and regression examples (bugs that were fixed).

Domain 5: Context Management & Prompt Architecture

~15% of exam

Key Concepts

  • Context window: maximising usage without silent truncation
  • Rolling conversation history: summarisation strategies
  • Retrieval-augmented context vs full-document context
  • System prompt design: persona, constraints, output format, examples
  • Prompt compression techniques
  • Temperature, top-p, top-k: effects on output diversity
WORKED SCENARIO 5.1

Long conversation silently truncates critical instructions

A customer service bot works perfectly for the first 10 turns of a conversation. In turn 15, it starts ignoring its system prompt instructions (e.g., never offering refunds without manager approval). Investigation reveals the conversation history has grown to fill the entire context window, truncating the system prompt. What should have been done architecturally?

Expert Analysis
  • This is a context window overflow failure. The application was appending conversation turns without managing total token count, and eventually the system prompt was pushed out of the context window by the growing conversation history.
  • What should have been done: (1) implement a rolling context budget that reserves a fixed allocation for the system prompt (e.g., 1000 tokens always reserved at the start), (2) summarise conversation history when it approaches the context limit — replace verbose turn-by-turn history with a compact summary of key facts and decisions, (3) monitor context window utilisation as a production metric.
  • Alternative: use a model with a longer context window, but this is more expensive and does not solve the root cause of unmanaged context growth.
Key Lesson: Context windows fill up. Applications must actively manage context — reserving space for critical instructions, summarising history, and monitoring utilisation. Silent truncation of system prompts is one of the most dangerous production LLM failures.
📋 Exam Tips for This Domain
  • Temperature = 0 is deterministic (always highest-probability token). Temperature = 1 samples from the distribution. Above 1 is typically too random for production.
  • Top-p = 0.9 samples from the 90% probability mass — a good default for creative tasks. Top-p = 0.1 is very conservative (near-deterministic).
  • Prompt compression techniques (like LLMLingua) can reduce token count by 30-50% with minimal quality impact — useful when context is tight and cannot grow.

Ready to sit the examination?

You now have the conceptual foundation. Expect applied-reasoning questions — read each scenario and identify which automation control prevents the described failure.

Purchase Exam Access — $79 →