Production AI Institute — vendor-neutral certification for AI practitioners
Verify a credentialFor organisationsContact
InsightsPSF Domain 2Implementation Guide
PSF Deep Dive · D2 Output Validation

Output Validation: From Schema Enforcement to Semantic Contracts

PSF-2 is the domain that separates systems that occasionally fail from systems that fail safely. Every model output is a hypothesis — PSF-2 is the test suite that rejects the ones that would break downstream systems, harm users, or expose your organisation to liability.

Why Output Validation Fails in Practice

Most teams add output validation after an incident. The model returns a hallucinated API call, a malformed JSON structure, or a response that passes to a downstream system and causes a cascade failure. The fix is always "we should have validated that." PSF-2 says: validate first, ship second.

The three failure modes we see most often:

The Output Contract

An output contract is a machine-verifiable specification of what a valid model output looks like. It has three layers:

L1 — Structural
Is the output the right format? Valid JSON/YAML/Markdown? Does it match the declared schema?
Tools: Pydantic, JSON Schema, zod
L2 — Semantic
Is the output factually coherent? Are referenced entities real? Does the output contradict the input context?
Tools: LLM-as-judge, embedding similarity, fact-check APIs
L3 — Business
Does the output comply with business rules? Is the recommended action within policy? Are values in acceptable ranges?
Tools: Custom validators, rules engine, allow/deny lists

Structural Validation: The Pydantic Pattern

For any system that expects structured output, define a Pydantic model and validate every model response against it. This is the minimum acceptable standard for PSF-2 compliance:

from pydantic import BaseModel, validator
from enum import Enum

class Sentiment(str, Enum):
    positive = "positive"
    negative = "negative"
    neutral = "neutral"

class SentimentOutput(BaseModel):
    sentiment: Sentiment
    confidence: float
    rationale: str

    @validator("confidence")
    def confidence_in_range(cls, v):
        if not 0.0 <= v <= 1.0:
            raise ValueError("confidence must be between 0 and 1")
        return v

    @validator("rationale")
    def rationale_not_empty(cls, v):
        if len(v.strip()) < 10:
            raise ValueError("rationale too short to be meaningful")
        return v

# Wrap every model call
def classify_sentiment(text: str) -> SentimentOutput | None:
    raw = call_model(text)  # your model call
    try:
        return SentimentOutput.model_validate_json(raw)
    except ValidationError as e:
        log_validation_failure(text, raw, str(e))
        return None  # route to fallback

Confidence Thresholds

If your model can express uncertainty (logprobs, explicit confidence field, or via structured output), use it. Define three zones:

≥ 0.85High confidenceSurface to user / take action
0.60–0.84Medium confidenceSurface with caveat / log for review
< 0.60Low confidenceRoute to human review / do not act
The threshold calibration problem: your thresholds should be calibrated to your specific domain and model, not set arbitrarily. Measure the false positive and false negative rates at different thresholds on a representative sample of your production data. Revisit quarterly — model behaviour drifts.

Framework-Specific Implementation

LangChain / LangGraph

Pydantic output parsers (PydanticOutputParser, StructuredOutputParser) are the primary D2 primitive. For LangGraph, add validation nodes between agent steps — an output validation node that receives agent output and raises a NodeInterrupt on failure. Use LangSmith to log validation failures for analysis.

Pydantic AI

Pydantic AI has the strongest native D2 of any framework assessed. result_type enforces schema at the API boundary. output_validators allow semantic validation. The framework retries on validation failure by default — configure max_retries carefully in production to avoid runaway costs.

AutoGen

Output validation is weakest natively. Implement a validate_response() wrapper around all agent calls. For structured output, add a dedicated ValidatorAgent that reviews responses before they are passed downstream. UserProxyAgent can be configured to reject responses that fail validation.

Semantic Kernel

KernelFunction return types provide schema enforcement. For complex validation, implement an IFunctionInvocationFilter as a post-processing step. Azure AI Content Safety can be wired as a validation layer for content policy compliance.

Haystack

OutputAdapter components are the native D2 primitive. Chain a validation component after any generation component. The pipeline architecture makes it clean to add OutputValidator as a standard pipeline node that halts execution and routes to fallback on validation failure.

Related guides

PSF D2: Output Validation canonical guideGuardrails tools comparison (NeMo, Guardrails AI, LLM Guard)PSF-compliant stack recipes — output validation toolingD1 Input Governance deep diveFramework comparison — D2 output validation scores
From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential
The Production AI Brief