Production AI Institute — vendor-neutral certification for AI practitioners

Verify a credential For organisations Contact

PSF Domain 4 Deep DiveIndependent · April 2026

LangSmith vs Langfuse vs Arize Phoenix
Observability for Production AI

PSF Domain 4 requires trace-level visibility into every production AI system — prompts, completions, tool calls, latency, cost, and quality. This is the comparison question every production AI team faces: which observability platform should you use, and what does each one actually give you?

The short answer: All three satisfy PSF D4 core requirements. The choice comes down to data residency, self-hosting needs, framework affinity, and existing MLOps infrastructure.

Independence disclosure: PAI has no commercial relationship with LangChain Inc., Langfuse GmbH, or Arize AI. Assessment conducted independently against PSF v1.1 criteria. CC BY 4.0.

Why PSF D4 Is Not Optional

PSF Domain 4 (Observability) requires that every production AI system emit structured, queryable records of its behaviour. Not logs. Not console output. Traces — span-level records that let you answer the question "exactly what happened, in what order, with what inputs and outputs, at what cost, and why did it fail?" after the fact.

Without this, production incidents are investigated by guessing. A user reports an incorrect output. You have no record of what the model was given, what the model responded, which tool was called, or what the tool returned. You cannot reproduce it. You cannot fix it confidently. You cannot prove to a client or regulator that the problem was isolated.

D4 compliance has a direct relationship with CPAP portfolio submissions — assessors look for evidence of structured observability. A deployment with no trace infrastructure is not PSF-compliant regardless of how well the other seven domains are addressed.

Side-by-Side Comparison

Capability

LangSmith

LangChain Inc.

Langfuse

Langfuse GmbH

Arize Phoenix

Arize AI

Trace Granularity

Span-level visibility into prompts, completions, tool calls, and agent reasoning steps

Strong

Evaluation Framework

Built-in tooling for scoring outputs, running evals, and tracking quality over time

Strong

Data Residency Control

Ability to keep trace data within a defined region or on your own infrastructure

Partial

Strong

Integration Breadth

Coverage across frameworks — LangChain, LangGraph, CrewAI, AutoGen, SK, custom Python

Strong

Partial

Self-Hosting Option

Ability to run the platform entirely on your own infrastructure without SaaS dependency

Gap

Strong

Cost & Token Tracking

Per-trace token usage, model cost attribution, and budget alerting

Strong

Partial

Alerting & Monitoring

Automated alerts on latency, error rates, cost anomalies, and quality degradation

Partial

Strong

OpenTelemetry Support

Native support for the OpenTelemetry trace standard — enables vendor portability

Partial

Strong

LangSmith

Strongsmith.langchain.com ↗

LangSmith is LangChain's purpose-built observability product. If you are using LangChain or LangGraph, LangSmith is the natural first choice — it integrates at the SDK level with zero configuration and provides trace granularity that other platforms cannot match for LangChain-based workflows.

STANDOUT CAPABILITIES

✓Automatic tracing of every LangChain component — chain steps, LLM calls, retriever queries, tool invocations — with no manual instrumentation

✓Dataset management and evaluation framework: run evals against a golden dataset and track quality regressions over time

✓Prompt hub for version-controlled prompt management — a genuine production workflow feature

✓Token usage and cost attribution per trace — essential for managing LLM spend in production

PRODUCTION CONCERNS

Data residency

LangSmith is SaaS-only (US-based). All trace data — including full prompt and completion text — leaves your infrastructure. For GDPR-regulated deployments or any deployment handling sensitive PII, this is a D3 risk, not just a D4 concern. LangSmith Enterprise offers single-tenant options, but self-hosting is not available.

Framework dependency

LangSmith's deepest integrations are with LangChain. If you are not using LangChain — or if you migrate away from it — your observability tooling must migrate too. The OpenTelemetry support has improved, but native LangChain experience remains the standout.

Alerting

LangSmith's alerting is less mature than Arize for production monitoring use cases. Custom alert rules and anomaly detection are available but less polished than a dedicated monitoring platform.

PSF D4 VERDICT

LangSmith satisfies PSF D4 comprehensively for LangChain deployments. The data residency concern is a D3 consideration that teams must address separately. For teams outside LangChain or with strict data residency requirements, an alternative should be evaluated.

Langfuse

Stronglangfuse.com ↗

Langfuse is an open-source observability platform that can be self-hosted on your own infrastructure or used as a managed cloud service. It is framework-agnostic, with SDKs for Python and TypeScript, and native integrations for LangChain, LlamaIndex, OpenAI, and others. For teams where data residency is a hard requirement, Langfuse is the most credible path to PSF D4 compliance.

STANDOUT CAPABILITIES

✓Full self-hosting: deploy Langfuse on your own Kubernetes cluster, AWS, or GCP. Trace data never leaves your infrastructure

✓OpenTelemetry-native: traces are emitted as standard OTEL spans, enabling backend portability and integration with existing observability infrastructure

✓Framework-agnostic: genuine support for LangChain, LangGraph, LlamaIndex, AutoGen, Semantic Kernel, and custom Python — not a secondary concern

✓Evaluation and scoring: built-in LLM-based evaluators, human annotation workflows, and dataset management comparable to LangSmith

✓Cost tracking across all major model providers — per-trace token attribution and cost roll-ups

PRODUCTION CONCERNS

Integration depth vs LangSmith

For pure LangChain deployments, LangSmith's auto-instrumentation captures more detail with less configuration than Langfuse. The gap has narrowed significantly in recent releases, but LangSmith's LangChain integration remains marginally richer.

Alerting

Like LangSmith, Langfuse's alerting is less mature than Arize for production monitoring. Custom threshold alerts exist; sophisticated anomaly detection is limited.

Self-hosting overhead

Self-hosting Langfuse requires infrastructure management — database, Redis, and the application itself. For small teams, the managed cloud option removes this overhead while still providing data residency configuration options.

PSF D4 VERDICT

Langfuse is the recommended platform for teams with GDPR, HIPAA, or other data residency requirements that preclude SaaS-only tooling. It satisfies D4 comprehensively and addresses D3 concerns through its self-hosting model. The best all-round choice for teams not primarily using LangChain.

Arize Phoenix

Strongphoenix.arize.com ↗

Arize Phoenix is the open-source observability product from Arize AI, a company with deep roots in ML observability (model drift detection, feature monitoring, dataset analysis). Phoenix brings that MLOps heritage to the LLM and agent space — making it the strongest choice for teams that are bridging traditional machine learning infrastructure and new agentic AI deployments.

STANDOUT CAPABILITIES

✓Strongest alerting and monitoring capabilities of the three: anomaly detection, threshold-based alerts, and drift monitoring informed by Arize's MLOps product experience

✓Self-hostable and open-source — same data residency benefits as Langfuse

✓Native OpenTelemetry support — spans, metrics, and traces in standard OTEL format

✓Evals framework with built-in evaluators for hallucination detection, relevance scoring, and toxicity classification

✓Notebook-friendly: designed to work in Jupyter / Colab workflows as well as production environments

PRODUCTION CONCERNS

Cost tracking

Cost and token attribution is less central to Phoenix than to LangSmith or Langfuse. It is available but requires more configuration. For teams where LLM cost management is a primary concern, LangSmith or Langfuse are more polished.

LangChain integration depth

Like Langfuse, Phoenix's LangChain integration is good but does not match LangSmith's auto-instrumentation depth for LangChain-specific components.

Community size

Phoenix has a smaller community than LangSmith. Fewer Stack Overflow answers, fewer blog posts, less third-party documentation. Teams that rely on community resources for troubleshooting will find LangSmith easier.

PSF D4 VERDICT

Arize Phoenix satisfies D4 comprehensively with the strongest monitoring and alerting story of the three platforms. The right choice for teams with existing Arize MLOps infrastructure, teams that need sophisticated production alerting, or teams bridging ML and LLM observability into a single platform.

Which One Should You Choose?

Your situationChooseWhy

Using LangChain / LangGraph, no strict data residency requirements

LangSmith

Deepest auto-instrumentation, best eval framework for LangChain. Zero configuration observability.

GDPR, HIPAA, or hard data residency — trace data must not leave your infrastructure

Langfuse

Full self-hosting on your own infrastructure. Open source. All trace data stays where you put it.

You already have Arize for ML model monitoring and want to extend to LLMs

Arize Phoenix

Single platform for ML and LLM observability. Alerting and drift detection carry over.

Framework-agnostic stack (AutoGen, SK, custom Python)

Langfuse or Arize

Both provide strong OTEL-native integrations that work across frameworks without LangChain dependency.

Production alerting and anomaly detection are a primary requirement

Arize Phoenix

Strongest alerting story. MLOps heritage in production monitoring shows.

Small team, fast start, willing to use managed SaaS

LangSmith or Langfuse Cloud

Both have generous free tiers. LangSmith if LangChain-native; Langfuse Cloud if you want future self-hosting flexibility.

Enterprise with strict vendor assessment requirements

Langfuse (self-hosted)

Open source, self-hosted, OpenTelemetry-native. Passes most enterprise vendor review requirements without SaaS data flow.

PSF D4 Minimum Requirements

Regardless of which platform you choose, your D4 implementation must satisfy these requirements for PSF compliance:

D4Every production request generates a structured trace with a unique trace ID

D4Traces capture: input prompt (sanitised), model response, tool calls and their results, latency per span, and token usage

D4Traces are queryable by trace ID, time range, and at minimum one business-relevant attribute (user ID, session, workflow type)

D4Traces are retained for a minimum of 90 days (or longer if required by applicable regulation)

D4A process exists to retrieve and review a specific trace for incident investigation — and has been tested

D4Cost and token usage is tracked at the deployment level, with alerting on anomalous spend

D4Quality degradation (error rates, latency spikes) triggers an operational alert within 15 minutes

Can You Use Multiple Platforms?

Yes, and it is sometimes the right call. LangSmith for LangChain workflow development and quality evaluation, plus Langfuse self-hosted for production trace storage (data residency compliance) is a common and defensible architecture. Arize Phoenix for production monitoring and alerting, with Langfuse for the evaluation workflow, is another viable combination.

The risk with multiple platforms is trace fragmentation — an incident investigation that requires correlating traces across two systems adds friction. If you use multiple tools, ensure there is a single trace ID that links records across them. OpenTelemetry's propagation headers make this straightforward if both platforms are OTEL-native.

PSF Domain 4: Observability →LangChain PSF Assessment →Semantic Kernel PSF Assessment →Agent Framework Comparison →Explore the full ecosystem →

From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential

LangSmith vs Langfuse vs Arize PhoenixObservability for Production AI

Why PSF D4 Is Not Optional

Side-by-Side Comparison

LangSmith

Langfuse

Arize Phoenix

Which One Should You Choose?

PSF D4 Minimum Requirements

Can You Use Multiple Platforms?

Related

You understand the gaps.Get the credential that proves it.

LangSmith vs Langfuse vs Arize Phoenix
Observability for Production AI

You understand the gaps.
Get the credential that proves it.