Output Validation: From Schema Enforcement to Semantic Contracts
PSF-2 is the domain that separates systems that occasionally fail from systems that fail safely. Every model output is a hypothesis — PSF-2 is the test suite that rejects the ones that would break downstream systems, harm users, or expose your organisation to liability.
Why Output Validation Fails in Practice
Most teams add output validation after an incident. The model returns a hallucinated API call, a malformed JSON structure, or a response that passes to a downstream system and causes a cascade failure. The fix is always "we should have validated that." PSF-2 says: validate first, ship second.
The three failure modes we see most often:
- Schema drift: The model was prompted with a JSON schema. It returns something close but not quite right — an extra field, a wrong type, a null where a string was expected. The consuming code crashes or silently uses wrong data.
- Semantic drift: The output is schema-valid but semantically wrong. A sentiment classifier returns "positive" for a complaint. A summariser returns a summary of the wrong document. Schema validation wouldn't catch this.
- Confidence blindness: The model is uncertain but returns an answer anyway. Without a confidence mechanism, you have no way to route uncertain outputs to human review.
The Output Contract
An output contract is a machine-verifiable specification of what a valid model output looks like. It has three layers:
Structural Validation: The Pydantic Pattern
For any system that expects structured output, define a Pydantic model and validate every model response against it. This is the minimum acceptable standard for PSF-2 compliance:
from pydantic import BaseModel, validator
from enum import Enum
class Sentiment(str, Enum):
positive = "positive"
negative = "negative"
neutral = "neutral"
class SentimentOutput(BaseModel):
sentiment: Sentiment
confidence: float
rationale: str
@validator("confidence")
def confidence_in_range(cls, v):
if not 0.0 <= v <= 1.0:
raise ValueError("confidence must be between 0 and 1")
return v
@validator("rationale")
def rationale_not_empty(cls, v):
if len(v.strip()) < 10:
raise ValueError("rationale too short to be meaningful")
return v
# Wrap every model call
def classify_sentiment(text: str) -> SentimentOutput | None:
raw = call_model(text) # your model call
try:
return SentimentOutput.model_validate_json(raw)
except ValidationError as e:
log_validation_failure(text, raw, str(e))
return None # route to fallbackConfidence Thresholds
If your model can express uncertainty (logprobs, explicit confidence field, or via structured output), use it. Define three zones:
Framework-Specific Implementation
Pydantic output parsers (PydanticOutputParser, StructuredOutputParser) are the primary D2 primitive. For LangGraph, add validation nodes between agent steps — an output validation node that receives agent output and raises a NodeInterrupt on failure. Use LangSmith to log validation failures for analysis.
Pydantic AI has the strongest native D2 of any framework assessed. result_type enforces schema at the API boundary. output_validators allow semantic validation. The framework retries on validation failure by default — configure max_retries carefully in production to avoid runaway costs.
Output validation is weakest natively. Implement a validate_response() wrapper around all agent calls. For structured output, add a dedicated ValidatorAgent that reviews responses before they are passed downstream. UserProxyAgent can be configured to reject responses that fail validation.
KernelFunction return types provide schema enforcement. For complex validation, implement an IFunctionInvocationFilter as a post-processing step. Azure AI Content Safety can be wired as a validation layer for content policy compliance.
OutputAdapter components are the native D2 primitive. Chain a validation component after any generation component. The pipeline architecture makes it clean to add OutputValidator as a standard pipeline node that halts execution and routes to fallback on validation failure.