The professional standard for production AI deployment

Verify a credential For organisations Partner Programme For nonprofits & NGOs Contact

CVAE · Specialist

Study Guide: Certified Voice AI Engineer

This guide covers all domains tested in the CVAE examination. Each domain includes key concepts, a worked scenario, and the reasoning approach examiners expect.

Take the exam — $97 →All certifications

Exam at a glance

Questions

25 drawn from a 30-question bank

Pass mark

18 correct (72%)

Time limit

45 minutes

Retake cooldown

3 attempts per 24-hour window

Fee

$97

Credential

Digital certificate + registry listing

Domain 1: ASR/TTS Fundamentals & Provider Selection

~25% of exam

Key Concepts

Word Error Rate (WER): calculation, benchmarks, acceptable thresholds
ASR providers: Deepgram, OpenAI Whisper, Google STT, Azure Cognitive Speech
TTS providers: ElevenLabs, OpenAI TTS, Eleven Turbo, Azure Neural TTS
Streaming TTS: sentence-boundary chunking for natural audio
Custom vocabulary / keyword boosting for domain terminology
WER benchmarks by domain: general (5%), call center (8-12%), medical (15%+)

WORKED SCENARIO 1.1

Voice AI bot consistently mishears medical drug names

A healthcare voice AI is being used by nurses to log patient medications. The ASR is mishearing drug names like 'lisinopril' as 'listen a pill' and 'metformin' as 'met for men'. What is the correct technical solution?

Expert Analysis

Drug names are low-frequency, domain-specific vocabulary that general ASR models underperform on. The base model has rarely seen these words in training and does not have strong priors for their correct phoneme sequences.
Solution 1 (quickest): Use the ASR provider's custom vocabulary or keyword boosting feature. Deepgram's custom vocabulary, for example, lets you specify words with phonetic hints and boosting weights. This is the fastest fix.
Solution 2 (robust): Fine-tune an ASR model on a dataset of medical audio with correct transcriptions of drug names. This is more powerful but requires labelled audio data.
Additional fix: add a post-processing step that uses a medical terminology dictionary to correct common ASR errors for known drug names — a deterministic fallback.
Do not try to fix this at the TTS or LLM layer — the problem is in the input transcription pipeline.

Key Lesson: Domain-specific vocabulary is the primary failure mode for production ASR. Custom vocabulary boosting is the fastest and most cost-effective first fix; fine-tuning is the most accurate but requires labelled data.

📋 Exam Tips for This Domain

WER = (Substitutions + Insertions + Deletions) / Total reference words. Know this formula and what each error type means.
ElevenLabs is the standard for high-fidelity expressive TTS and voice cloning. OpenAI TTS is simpler and cheaper but less expressive. Know when to use each.
Streaming TTS: accumulate until sentence boundary (period, question mark, newline), then synthesise and stream that chunk. This balances perceived latency vs naturalness.

Domain 2: Voice Agent Architecture (Vapi, Bland, Retell)

~25% of exam

Key Concepts

Vapi: orchestration platform — telephony + ASR/TTS routing + LLM configuration
Bland AI: outbound call automation at scale — voicemail detection, retry logic
Retell AI: voice agent platform with built-in conversation state
Architecture layers: telephony → media server → ASR → LLM → TTS → telephony
Turn detection: VAD + semantic models
Barge-in (interruption handling): interrupt buffer, TTS cancellation

WORKED SCENARIO 2.1

Voice agent fails to handle user interruptions gracefully

Your Vapi-based customer service agent plays a 15-second greeting. Users frequently interrupt mid-greeting with their query, but the agent ignores the interruption and continues playing the full greeting before responding. Users are frustrated. What needs to be changed?

Expert Analysis

This is a barge-in failure. The agent is not detecting user speech during its own output, or it is detecting it but not cancelling the TTS playback.
Check 1: Is barge-in enabled in the Vapi configuration? By default, some platforms disable barge-in to prevent accidental interruptions from background noise.
Check 2: What is the interrupt buffer threshold? If set too high, only sustained speech triggers barge-in. Reduce the minimum barge-in duration.
Check 3: Is the greeting too long? A 15-second greeting is inherently friction-heavy. Reduce to 5-8 seconds maximum. Users interrupt long greetings because the greeting itself is the problem.
Optimal barge-in: detect speech within 300ms of user speaking, cancel TTS within 100ms of detection, process the user input as the primary intent.

Key Lesson: Barge-in is not optional for production voice AI — it is a fundamental UX requirement. Long greetings that cannot be interrupted are a design failure, not just a technical configuration issue.

WORKED SCENARIO 2.2

Designing outbound appointment reminder campaign with Bland AI

A dental clinic wants to use Bland AI to call patients 48 hours before appointments. Design the architecture for: voicemail detection, successful connection handling, and rescheduling requests.

Expert Analysis

Voicemail detection: Bland AI has built-in AMD (Answering Machine Detection). Configure the AMD sensitivity and the voicemail message separately from the live-call script. Voicemail should leave a brief reminder; live calls should engage interactively.
Successful connection: the agent should greet, state the purpose, confirm the appointment, and offer rescheduling if needed. Keep it under 3 exchanges for a reminder call.
Rescheduling: use function calling to check calendar availability and book a new slot. The agent should confirm the new time before hanging up.
Retry logic: if no answer, retry up to 2 times with 2-hour intervals. After 3 attempts, flag for human follow-up in the CRM.
Compliance: configure explicit call-recording disclosure per TCPA requirements. Obtain consent before calling.

Key Lesson: Outbound voice AI requires more careful workflow design than inbound — voicemail handling, retry logic, and compliance disclosures are as important as the conversation quality itself.

📋 Exam Tips for This Domain

Vapi handles orchestration: telephony, audio routing, ASR/TTS provider selection. The developer configures the LLM system prompt and tools.
Bland AI is optimised for outbound calling at scale — know its key differentiators: voicemail detection, call scheduling, retry logic, CRM integration.
Turn detection uses VAD (Voice Activity Detection) for speech/silence classification, plus semantic models that assess whether the user's sentence feels complete.

Domain 3: Latency Optimisation & Real-Time Architecture

~20% of exam

Key Concepts

End-to-end latency budget: ASR + LLM + TTS + network
Time-to-first-audio-byte (TTFAB): the perceived latency metric
Streaming pipeline: concurrent ASR decoding, LLM generation, TTS synthesis
Codec selection: Opus for WebRTC, G.711 for PSTN
Audio buffering strategies
Network topology: edge deployment for reduced RTT

WORKED SCENARIO 3.1

Voice agent latency is 4.2 seconds — optimise to under 1.5 seconds

Your voice AI agent has an end-to-end latency (user stops speaking → agent starts speaking) of 4.2 seconds. Users report it feels unnatural. Break down the latency budget and design an optimisation plan to reach under 1.5 seconds.

Expert Analysis

Measure each component: typical breakdown for a 4.2s total: ASR (0.8s), LLM first token (2.0s), TTS first audio (1.0s), network round trips (0.4s).
ASR optimisation: switch to streaming ASR (Deepgram Nova-2 streaming) — instead of waiting for the complete utterance, start processing after the first 200-300ms of detected speech completion. Target: 0.3-0.5s.
LLM optimisation: first-token latency is often the largest component. Switch to a faster model tier (GPT-4o-mini vs GPT-4o reduces first-token from ~600ms to ~200ms). Reduce system prompt length. Target: 0.6-0.8s.
TTS optimisation: switch to streaming TTS (ElevenLabs Turbo v2 or OpenAI TTS streaming). Start audio playback after the first sentence is synthesised, not after the full response. Target: 0.2-0.4s.
With these changes: 0.4 + 0.7 + 0.3 + 0.1 = 1.5s total — at the target.

Key Lesson: Voice AI latency is a pipeline problem. Each component adds to the end-to-end total, and streaming at each stage is the primary optimisation lever. Streaming ASR, streaming LLM tokens, and streaming TTS are independent wins that compound.

📋 Exam Tips for This Domain

TTFAB (Time-to-First-Audio-Byte) is more important than total generation latency — users perceive the wait from their last word to the first audio byte, not from their last word to the last audio byte.
The target latency for natural-feeling voice AI: under 1.5s TTFAB. Under 1s is excellent. Over 2s feels unnatural. Over 3s is commercially unacceptable for most use cases.
Opus is the codec for WebRTC/WebSocket voice AI. G.711 µ-law/A-law is the standard for PSTN (phone network) audio. Know both.

Domain 4: Telephony Integration & Compliance

~15% of exam

Key Concepts

Twilio TwiML: media stream setup for WebSocket audio
SIP: Session Initiation Protocol for enterprise telephony
DTMF: keypad input as ASR fallback
Call recording consent: TCPA (US), GDPR (EU)
Two-party consent states in the US
PCI-DSS considerations for voice payment handling

WORKED SCENARIO 4.1

Building a compliant outbound voice AI in multi-jurisdiction deployment

You are deploying an outbound voice AI for a financial services firm. Calls will go to US customers across all 50 states. The agent discusses account information. List the key compliance requirements and how to implement them technically.

Expert Analysis

Recording consent: the US is a patchwork of federal (one-party consent under ECPA) and state law (11 states require two-party consent: CA, CT, FL, IL, MD, MA, MI, MT, NV, NH, OR, PA, WA). Implement a consent disclosure at the start of every call: 'This call may be recorded for quality and compliance purposes.' This satisfies all-party consent requirements when the called party continues the call.
TCPA compliance: outbound calls to mobile phones require prior express written consent. Maintain a do-not-call list, honour opt-out requests within 30 days, and do not call before 8am or after 9pm local time.
Financial disclosure: any discussion of account balances, trades, or financial advice must comply with applicable financial regulations (FINRA, SEC). The voice AI must have disclaimers and must not provide investment advice.
Data handling: account numbers spoken on calls are PCI-sensitive. Do not log audio or transcripts containing full account numbers. Implement PII/PAN redaction in the transcript pipeline.

Key Lesson: Compliance for voice AI in regulated industries is multi-layered: federal law, state law, and industry-specific regulations all apply simultaneously. Technical implementation must include consent recording, call time enforcement, PII redaction, and a do-not-call list integration.

📋 Exam Tips for This Domain

TCPA (US): prior express written consent for outbound marketing calls to mobile phones. Violations carry $500-$1,500 per call in penalties.
DTMF should always be available as a fallback to speech — critical for noisy environments, accessibility, and legacy IVR compatibility.
Warm transfer = routing call to human WITH context (transcript + AI summary). Cold transfer = routing with no context. Always implement warm transfer for escalation paths.

Domain 5: Voice AI Safety, Quality & Production Operations

~15% of exam

Key Concepts

Hallucination risk in voice AI: why it is higher-stakes than text
Confirmation patterns: explicit confirmation before consequential actions
Sentiment analysis: audio-level (paralinguistic) vs text-level
Speaker diarisation: multi-party call attribution
Acoustic echo cancellation (AEC): preventing self-feedback loops
Production runbook: degradation, fallback, escalation

WORKED SCENARIO 5.1

Voice agent books wrong appointment time due to ASR error

A voice AI appointment booking agent has been booking appointments at incorrect times — users say '3 PM' but ASR transcribes '3 AM' and the agent books without confirmation. Three patients have missed appointments. Implement a fix.

Expert Analysis

Root cause: the agent is executing consequential actions (bookings) based on ASR output without explicit confirmation. ASR errors on time/date are common and high-impact.
Fix 1 (immediate): add explicit read-back confirmation before booking. After the user states a time, the agent must say: 'I have 3 AM on Tuesday the 14th — is that correct?' and require a 'yes/correct/right' before booking.
Fix 2 (structural): implement an entity extraction step with validation. Extract time from the ASR transcript using a structured parser. If the extracted time is outside business hours (3 AM), flag it as likely an ASR error and ask the user to confirm or repeat.
Fix 3 (monitoring): track booking-time distribution. 3 AM appointments for a dental clinic are outliers that should trigger an alert.
Do not rely on ASR accuracy alone for consequential actions — always implement explicit confirmation as a standard pattern.

Key Lesson: In voice AI, consequential actions (bookings, orders, transfers, cancellations) must always have an explicit confirmation step. ASR is probabilistic — a confirmation loop is the safety net that catches transcription errors before they cause real harm.

📋 Exam Tips for This Domain

Voice AI hallucinations are higher-stakes than text: users cannot re-read the output, conversational trust is higher, and consequential actions happen in real-time.
AEC (Acoustic Echo Cancellation) prevents the ASR from transcribing the agent's own TTS output as user input — it is required in any full-duplex system.
Sentiment analysis at the audio level (pitch, energy, rate) can detect caller frustration before the text content does — this is a leading indicator for escalation triggers.

Ready to sit the examination?

You now have the conceptual foundation. Past sittings have focused on scenario reasoning — identify the failure mode described, select the control that addresses root cause, and eliminate answers that treat symptoms.

Purchase Exam Access — $97 →