Output Validation: From Schema Enforcement to Semantic Contracts

PSF-2 is the domain that separates systems that occasionally fail from systems that fail safely. Every model output is a hypothesis — PSF-2 is the test suite that rejects the ones that would break downstream systems, harm users, or expose your organisation to liability.

Why Output Validation Fails in Practice

Most teams add output validation after an incident. The model returns a hallucinated API call, a malformed JSON structure, or a response that passes to a downstream system and causes a cascade failure. The fix is always "we should have validated that." PSF-2 says: validate first, ship second.

The three failure modes we see most often:

Schema drift: The model was prompted with a JSON schema. It returns something close but not quite right — an extra field, a wrong type, a null where a string was expected. The consuming code crashes or silently uses wrong data.
Semantic drift: The output is schema-valid but semantically wrong. A sentiment classifier returns "positive" for a complaint. A summariser returns a summary of the wrong document. Schema validation wouldn't catch this.
Confidence blindness: The model is uncertain but returns an answer anyway. Without a confidence mechanism, you have no way to route uncertain outputs to human review.

The Output Contract

An output contract is a machine-verifiable specification of what a valid model output looks like. It has three layers:

L1 — Structural

Is the output the right format? Valid JSON/YAML/Markdown? Does it match the declared schema?

Tools: Pydantic, JSON Schema, zod

L2 — Semantic

Is the output factually coherent? Are referenced entities real? Does the output contradict the input context?

Tools: LLM-as-judge, embedding similarity, fact-check APIs

L3 — Business

Does the output comply with business rules? Is the recommended action within policy? Are values in acceptable ranges?

Tools: Custom validators, rules engine, allow/deny lists

Structural Validation: The Pydantic Pattern

For any system that expects structured output, define a Pydantic model and validate every model response against it. This is the minimum acceptable standard for PSF-2 compliance:

from pydantic import BaseModel, validator
from enum import Enum

class Sentiment(str, Enum):
    positive = "positive"
    negative = "negative"
    neutral = "neutral"

class SentimentOutput(BaseModel):
    sentiment: Sentiment
    confidence: float
    rationale: str

    @validator("confidence")
    def confidence_in_range(cls, v):
        if not 0.0 <= v <= 1.0:
            raise ValueError("confidence must be between 0 and 1")
        return v

    @validator("rationale")
    def rationale_not_empty(cls, v):
        if len(v.strip()) < 10:
            raise ValueError("rationale too short to be meaningful")
        return v

# Wrap every model call
def classify_sentiment(text: str) -> SentimentOutput | None:
    raw = call_model(text)  # your model call
    try:
        return SentimentOutput.model_validate_json(raw)
    except ValidationError as e:
        log_validation_failure(text, raw, str(e))
        return None  # route to fallback

Confidence Thresholds

If your model can express uncertainty (logprobs, explicit confidence field, or via structured output), use it. Define three zones:

≥ 0.85High confidenceSurface to user / take action

0.60–0.84Medium confidenceSurface with caveat / log for review

< 0.60Low confidenceRoute to human review / do not act

The threshold calibration problem: your thresholds should be calibrated to your specific domain and model, not set arbitrarily. Measure the false positive and false negative rates at different thresholds on a representative sample of your production data. Revisit quarterly — model behaviour drifts.

Framework-Specific Implementation

LangChain / LangGraph

Pydantic output parsers (PydanticOutputParser, StructuredOutputParser) are the primary D2 primitive. For LangGraph, add validation nodes between agent steps — an output validation node that receives agent output and raises a NodeInterrupt on failure. Use LangSmith to log validation failures for analysis.

Pydantic AI

Pydantic AI has the strongest native D2 of any framework assessed. result_type enforces schema at the API boundary. output_validators allow semantic validation. The framework retries on validation failure by default — configure max_retries carefully in production to avoid runaway costs.

AutoGen

Output validation is weakest natively. Implement a validate_response() wrapper around all agent calls. For structured output, add a dedicated ValidatorAgent that reviews responses before they are passed downstream. UserProxyAgent can be configured to reject responses that fail validation.

Semantic Kernel

KernelFunction return types provide schema enforcement. For complex validation, implement an IFunctionInvocationFilter as a post-processing step. Azure AI Content Safety can be wired as a validation layer for content policy compliance.

Haystack

OutputAdapter components are the native D2 primitive. Chain a validation component after any generation component. The pipeline architecture makes it clean to add OutputValidator as a standard pipeline node that halts execution and routes to fallback on validation failure.

Output Validation: From Schema Enforcement to Semantic Contracts

Why Output Validation Fails in Practice

The Output Contract

Structural Validation: The Pydantic Pattern

Confidence Thresholds

Framework-Specific Implementation

Related guides

You understand the gaps.
Get the credential that proves it.

Output Validation: From Schema Enforcement to Semantic Contracts

Why Output Validation Fails in Practice

The Output Contract

Structural Validation: The Pydantic Pattern

Confidence Thresholds

Framework-Specific Implementation

Related guides

You understand the gaps.Get the credential that proves it.

Get framework updates in your inbox

You understand the gaps.
Get the credential that proves it.