PSF Domain 4 Deep DiveIndependent · April 2026
LangSmith vs Langfuse vs Arize Phoenix
Observability for Production AI
PSF Domain 4 requires trace-level visibility into every production AI system — prompts, completions, tool calls, latency, cost, and quality. This is the comparison question every production AI team faces: which observability platform should you use, and what does each one actually give you?
The short answer: All three satisfy PSF D4 core requirements. The choice comes down to data residency, self-hosting needs, framework affinity, and existing MLOps infrastructure.
Independence disclosure: PAI has no commercial relationship with LangChain Inc., Langfuse GmbH, or Arize AI. Assessment conducted independently against PSF v1.1 criteria. CC BY 4.0.
Why PSF D4 Is Not Optional
PSF Domain 4 (Observability) requires that every production AI system emit structured, queryable records of its behaviour. Not logs. Not console output. Traces — span-level records that let you answer the question "exactly what happened, in what order, with what inputs and outputs, at what cost, and why did it fail?" after the fact.
Without this, production incidents are investigated by guessing. A user reports an incorrect output. You have no record of what the model was given, what the model responded, which tool was called, or what the tool returned. You cannot reproduce it. You cannot fix it confidently. You cannot prove to a client or regulator that the problem was isolated.
D4 compliance has a direct relationship with CPAP portfolio submissions — assessors look for evidence of structured observability. A deployment with no trace infrastructure is not PSF-compliant regardless of how well the other seven domains are addressed.
Side-by-Side Comparison
Trace Granularity
Span-level visibility into prompts, completions, tool calls, and agent reasoning steps
Strong
Strong
Strong
Evaluation Framework
Built-in tooling for scoring outputs, running evals, and tracking quality over time
Strong
Strong
Strong
Data Residency Control
Ability to keep trace data within a defined region or on your own infrastructure
Partial
Strong
Strong
Integration Breadth
Coverage across frameworks — LangChain, LangGraph, CrewAI, AutoGen, SK, custom Python
Strong
Partial
Partial
Self-Hosting Option
Ability to run the platform entirely on your own infrastructure without SaaS dependency
Gap
Strong
Strong
Cost & Token Tracking
Per-trace token usage, model cost attribution, and budget alerting
Strong
Strong
Partial
Alerting & Monitoring
Automated alerts on latency, error rates, cost anomalies, and quality degradation
Partial
Partial
Strong
OpenTelemetry Support
Native support for the OpenTelemetry trace standard — enables vendor portability
Partial
Strong
Strong
LangSmith is LangChain's purpose-built observability product. If you are using LangChain or LangGraph, LangSmith is the natural first choice — it integrates at the SDK level with zero configuration and provides trace granularity that other platforms cannot match for LangChain-based workflows.
STANDOUT CAPABILITIES
✓Automatic tracing of every LangChain component — chain steps, LLM calls, retriever queries, tool invocations — with no manual instrumentation
✓Dataset management and evaluation framework: run evals against a golden dataset and track quality regressions over time
✓Prompt hub for version-controlled prompt management — a genuine production workflow feature
✓Token usage and cost attribution per trace — essential for managing LLM spend in production
PRODUCTION CONCERNS
Data residency
LangSmith is SaaS-only (US-based). All trace data — including full prompt and completion text — leaves your infrastructure. For GDPR-regulated deployments or any deployment handling sensitive PII, this is a D3 risk, not just a D4 concern. LangSmith Enterprise offers single-tenant options, but self-hosting is not available.
Framework dependency
LangSmith's deepest integrations are with LangChain. If you are not using LangChain — or if you migrate away from it — your observability tooling must migrate too. The OpenTelemetry support has improved, but native LangChain experience remains the standout.
Alerting
LangSmith's alerting is less mature than Arize for production monitoring use cases. Custom alert rules and anomaly detection are available but less polished than a dedicated monitoring platform.
PSF D4 VERDICT
LangSmith satisfies PSF D4 comprehensively for LangChain deployments. The data residency concern is a D3 consideration that teams must address separately. For teams outside LangChain or with strict data residency requirements, an alternative should be evaluated.
Langfuse is an open-source observability platform that can be self-hosted on your own infrastructure or used as a managed cloud service. It is framework-agnostic, with SDKs for Python and TypeScript, and native integrations for LangChain, LlamaIndex, OpenAI, and others. For teams where data residency is a hard requirement, Langfuse is the most credible path to PSF D4 compliance.
STANDOUT CAPABILITIES
✓Full self-hosting: deploy Langfuse on your own Kubernetes cluster, AWS, or GCP. Trace data never leaves your infrastructure
✓OpenTelemetry-native: traces are emitted as standard OTEL spans, enabling backend portability and integration with existing observability infrastructure
✓Framework-agnostic: genuine support for LangChain, LangGraph, LlamaIndex, AutoGen, Semantic Kernel, and custom Python — not a secondary concern
✓Evaluation and scoring: built-in LLM-based evaluators, human annotation workflows, and dataset management comparable to LangSmith
✓Cost tracking across all major model providers — per-trace token attribution and cost roll-ups
PRODUCTION CONCERNS
Integration depth vs LangSmith
For pure LangChain deployments, LangSmith's auto-instrumentation captures more detail with less configuration than Langfuse. The gap has narrowed significantly in recent releases, but LangSmith's LangChain integration remains marginally richer.
Alerting
Like LangSmith, Langfuse's alerting is less mature than Arize for production monitoring. Custom threshold alerts exist; sophisticated anomaly detection is limited.
Self-hosting overhead
Self-hosting Langfuse requires infrastructure management — database, Redis, and the application itself. For small teams, the managed cloud option removes this overhead while still providing data residency configuration options.
PSF D4 VERDICT
Langfuse is the recommended platform for teams with GDPR, HIPAA, or other data residency requirements that preclude SaaS-only tooling. It satisfies D4 comprehensively and addresses D3 concerns through its self-hosting model. The best all-round choice for teams not primarily using LangChain.
Arize Phoenix is the open-source observability product from Arize AI, a company with deep roots in ML observability (model drift detection, feature monitoring, dataset analysis). Phoenix brings that MLOps heritage to the LLM and agent space — making it the strongest choice for teams that are bridging traditional machine learning infrastructure and new agentic AI deployments.
STANDOUT CAPABILITIES
✓Strongest alerting and monitoring capabilities of the three: anomaly detection, threshold-based alerts, and drift monitoring informed by Arize's MLOps product experience
✓Self-hostable and open-source — same data residency benefits as Langfuse
✓Native OpenTelemetry support — spans, metrics, and traces in standard OTEL format
✓Evals framework with built-in evaluators for hallucination detection, relevance scoring, and toxicity classification
✓Notebook-friendly: designed to work in Jupyter / Colab workflows as well as production environments
PRODUCTION CONCERNS
Cost tracking
Cost and token attribution is less central to Phoenix than to LangSmith or Langfuse. It is available but requires more configuration. For teams where LLM cost management is a primary concern, LangSmith or Langfuse are more polished.
LangChain integration depth
Like Langfuse, Phoenix's LangChain integration is good but does not match LangSmith's auto-instrumentation depth for LangChain-specific components.
Community size
Phoenix has a smaller community than LangSmith. Fewer Stack Overflow answers, fewer blog posts, less third-party documentation. Teams that rely on community resources for troubleshooting will find LangSmith easier.
PSF D4 VERDICT
Arize Phoenix satisfies D4 comprehensively with the strongest monitoring and alerting story of the three platforms. The right choice for teams with existing Arize MLOps infrastructure, teams that need sophisticated production alerting, or teams bridging ML and LLM observability into a single platform.
Which One Should You Choose?
Your situationChooseWhy
Using LangChain / LangGraph, no strict data residency requirements
LangSmith
Deepest auto-instrumentation, best eval framework for LangChain. Zero configuration observability.
GDPR, HIPAA, or hard data residency — trace data must not leave your infrastructure
Langfuse
Full self-hosting on your own infrastructure. Open source. All trace data stays where you put it.
You already have Arize for ML model monitoring and want to extend to LLMs
Arize Phoenix
Single platform for ML and LLM observability. Alerting and drift detection carry over.
Framework-agnostic stack (AutoGen, SK, custom Python)
Langfuse or Arize
Both provide strong OTEL-native integrations that work across frameworks without LangChain dependency.
Production alerting and anomaly detection are a primary requirement
Arize Phoenix
Strongest alerting story. MLOps heritage in production monitoring shows.
Small team, fast start, willing to use managed SaaS
LangSmith or Langfuse Cloud
Both have generous free tiers. LangSmith if LangChain-native; Langfuse Cloud if you want future self-hosting flexibility.
Enterprise with strict vendor assessment requirements
Langfuse (self-hosted)
Open source, self-hosted, OpenTelemetry-native. Passes most enterprise vendor review requirements without SaaS data flow.
PSF D4 Minimum Requirements
Regardless of which platform you choose, your D4 implementation must satisfy these requirements for PSF compliance:
D4Every production request generates a structured trace with a unique trace ID
D4Traces capture: input prompt (sanitised), model response, tool calls and their results, latency per span, and token usage
D4Traces are queryable by trace ID, time range, and at minimum one business-relevant attribute (user ID, session, workflow type)
D4Traces are retained for a minimum of 90 days (or longer if required by applicable regulation)
D4A process exists to retrieve and review a specific trace for incident investigation — and has been tested
D4Cost and token usage is tracked at the deployment level, with alerting on anomalous spend
D4Quality degradation (error rates, latency spikes) triggers an operational alert within 15 minutes
Can You Use Multiple Platforms?
Yes, and it is sometimes the right call. LangSmith for LangChain workflow development and quality evaluation, plus Langfuse self-hosted for production trace storage (data residency compliance) is a common and defensible architecture. Arize Phoenix for production monitoring and alerting, with Langfuse for the evaluation workflow, is another viable combination.
The risk with multiple platforms is trace fragmentation — an incident investigation that requires correlating traces across two systems adds friction. If you use multiple tools, ensure there is a single trace ID that links records across them. OpenTelemetry's propagation headers make this straightforward if both platforms are OTEL-native.