Production AI Institute — vendor-neutral certification for AI practitioners

Verify a credential For organisations Nonprofits & NGOs Contact

PAI Lab

Active · Q2 2026

Next update due: August 2026

We test production AI systems
so you know exactly what you're deploying

The PAI Lab runs structured reliability tests on frontier AI models and agent frameworks against Production Safety Framework criteria. 113 test tasks across 6 domains. Quarterly scorecards. Fully open methodology. No vendor sponsorship.

View model scorecards →Read the methodology

113

Test tasks

Task domains

Models tested Q2 2026

PSF dimensions scored

Open

Methodology

Quarterly

Cadence

What PAI Lab measures: PSF compliance — production safety dimensions. Not general capability, speed, or cost. For latency, throughput, and quality benchmarks, see the external benchmark references below.

Scores last updated: May 2026

Q2 2026 key findings

Based on 4 models × 113 tasks × 5 runs each

⚠ Critical finding

Human oversight triggers are the most variable dimension

Across all models tested in Q2 2026, PSF-05 (Human Oversight) showed the widest variance — a 24-point spread between lowest and highest performer. Models that score well here tend to under-claim certainty; those that score poorly tend to complete tasks confidently rather than escalate.

Implication: Practitioners deploying AI in high-stakes decision contexts should run PSF-05 specific testing before deployment regardless of vendor claims.

Advisory

Self-hosted models outperform on data protection, underperform on observability

Llama 3.1 70B in a self-hosted configuration achieved the highest PSF-03 (Data Protection) score in the cohort — no data leaves the deployment environment. However, PSF-04 (Observability) scores were 12–18 points lower than API-hosted equivalents, reflecting the absence of vendor-provided logging infrastructure.

Implication: Data-sensitive deployments should consider self-hosted models, but must budget for observability tooling investment.

⚠ Critical finding

Prompt injection via code comments is consistently underguarded

In TDL-05 (Code Generation), all models in the current cohort allowed prompt injection through carefully crafted code comments in 3–7% of test cases. The attack vector is consistent and reproducible.

Implication: Any production system that processes externally-sourced code as LLM input requires an input sanitisation layer at the application level, not the model level.

Advisory

Context window stress reveals silent truncation in 2 of 4 models

In TDL-06 (Observability), two models silently truncated context under load without emitting warnings or degraded-mode signals. The other two either raised explicit errors or downgraded to a declared reduced-context mode.

Implication: Silent context truncation is a PSF-04 failure. Production systems should implement explicit context budget management at the application layer.

Tested Workflows

Production workflows we've run through the Lab

Each workflow below has been executed against the relevant PAI Lab test domain — real tasks, real inputs, real failure detection. The diagrams show the PSF-compliant architecture we tested. Click any to build or adapt it in Studio.

TDL-01 · Document Processing✓ Lab tested Q2 2026

Invoice Processing Pipeline

Email intake → document classification → field extraction → PO matching → conditional human gate. Tested across 24 invoice variants including edge cases (missing PO, partial match, duplicate detection).

PSF-01Input Governance

PSF-02Output Validation

PSF-05Human Oversight

Key finding: Field hallucination in 2.1% of edge-case invoices — caught by PSF-D2 output validation gate. Human escalation triggered correctly on all exception paths.

Build in Studio →

TDL-02 · High-Stakes Decision✓ Lab tested Q2 2026

Loan Assessment — Decision Support

AI provides risk score and rationale. Human underwriter makes every final decision — no autonomous approvals. Parallel explainability track generates audit trail for every assessment.

PSF-03Data Protection

PSF-05Human Oversight

PSF-04Observability

Advisory: Confidence miscalibration in 4.8% of borderline cases — model over-confident on inputs near the decision boundary. Observability logging missing on 11% of edge paths.

Build in Studio →

TDL-04 · Customer-Facing✓ Lab tested Q2 2026

Customer Support Triage + Auto-Resolution

Priority routing with P1 escalation to human agents. P2/P3 tickets auto-resolved. Tested with adversarial inputs including indirect guardrail bypass attempts and off-topic injections.

PSF-01Input Governance

PSF-02Output Validation

PSF-05Human Oversight

Critical: Guardrail bypass via indirect phrasing succeeded in 3.2% of adversarial runs. Mitigation: add semantic similarity check at input classification layer.

Build in Studio →

TDL-06 · Observability✓ Lab tested Q2 2026

API Failure Recovery + Graceful Degradation

Continuous health monitoring with automatic failover to secondary provider and degraded-mode fallback. Tests PSF-D4 (Observability) and PSF-D8 (Vendor Resilience) under simulated provider outage.

PSF-04Observability

PSF-08Vendor Resilience

PSF-06Deployment Safety

Advisory: Silent context truncation observed on 2 of 4 tested model providers under maximum context load. PSF-D4 requires explicit warning emission — add application-layer context budget management.

Build in Studio →

Adapt any of these workflows in PAI Studio →

Model Scorecards

Q2 2026 — PSF reliability index

Every model is tested on identical infrastructure with identical task suites. Scores reflect median performance across 5 runs per task. No model has been given advance access to the test library.

OpenAI

GPT-4.1

Tested Q2 2026

PSF Index / 100

Input GovernancePSF-01

Output ValidationPSF-02

Data ProtectionPSF-03

ObservabilityPSF-04

Human Oversight TriggersPSF-05

Deployment SafetyPSF-06

Security PosturePSF-07

Vendor ResiliencePSF-08

Lab note: Strong on structured output adherence. Notable gap: PII handling in summarisation tasks (PSF-03). Escalation trigger reliability above average.

Anthropic

Claude Sonnet 4.6

Tested Q2 2026

PSF Index / 100

Input GovernancePSF-01

Output ValidationPSF-02

Data ProtectionPSF-03

ObservabilityPSF-04

Human Oversight TriggersPSF-05

Deployment SafetyPSF-06

Security PosturePSF-07

Vendor ResiliencePSF-08

Lab note: Highest human oversight trigger accuracy in the current cohort. Observability logging incomplete under high-load simulation. Consistent refusal behaviour.

Google

Gemini 1.5 Pro

Tested Q2 2026

PSF Index / 100

Input GovernancePSF-01

Output ValidationPSF-02

Data ProtectionPSF-03

ObservabilityPSF-04

Human Oversight TriggersPSF-05

Deployment SafetyPSF-06

Security PosturePSF-07

Vendor ResiliencePSF-08

Lab note: Consistent mid-range performer. Weakest in security posture (PSF-07) — code generation tasks showed higher prompt injection susceptibility. Context window handling needs attention.

Meta (self-hosted)

Llama 3.1 70B

Tested Q2 2026

PSF Index / 100

Input GovernancePSF-01

Output ValidationPSF-02

Data ProtectionPSF-03

ObservabilityPSF-04

Human Oversight TriggersPSF-05

Deployment SafetyPSF-06

Security PosturePSF-07

Vendor ResiliencePSF-08

Lab note: Data protection (PSF-03) outperforms proprietary models in self-hosted configuration — no third-party data egress. Observability and security posture require significant investment at the deployment layer.

Scheduled — next quarter

Mistral

Mistral Large 2

Testing begins Q3 2026

Cohere

Command R+

Testing begins Q3 2026

External References

General capability & performance benchmarks

PAI Lab measures PSF compliance — production safety dimensions. For speed, cost, and general capability, these are the references we recommend. They measure different things and are complementary to our scores.

Artificial Analysis LLM Leaderboard

Continuous

Speed, cost, and quality benchmarks across frontier models. Best source for latency and throughput comparison.

Latency · Throughput · Cost per token · Quality index

Visit leaderboard →

LMSYS Chatbot Arena

Continuous

Human preference ELO rankings from millions of blind pairwise comparisons. Best for real-world conversational quality.

ELO ranking · Human preference · Conversational quality

Visit leaderboard →

Open LLM Leaderboard (HuggingFace)

Weekly

Academic benchmark suite (MMLU, HellaSwag, ARC, WinoGrande, GSM8K, HumanEval) for open-weight models.

MMLU · Reasoning · Coding · Math

Visit leaderboard →

Test Task Library

113 tasks across 6 production domains

Tasks are drawn from real production use cases. Each has deterministic ground truth — a correct answer we can score mechanically. The library is versioned and updated annually.

TDL-01

Document Processing

24 tasks

Extraction, classification, and summarisation tasks across contracts, invoices, clinical notes, and regulatory filings. Tests input governance, schema adherence, and output fidelity under document variation.

PSF coverage

PSF-01, PSF-02

Common failure types

Hallucinated field values
Schema drift under format variation
PII leakage in summarisation output

TDL-02

High-Stakes Decision Support

18 tasks

Loan assessment, triage classification, hiring screening, and risk scoring tasks. Tests output consistency, confidence calibration, and human escalation trigger reliability.

PSF coverage

PSF-03, PSF-05

Common failure types

Confidence miscalibration
Inconsistent outputs across equivalent inputs
Missing escalation triggers on edge cases

TDL-03

Multi-Agent Orchestration

21 tasks

Agent-to-agent handoff sequences across research, code generation, and workflow automation pipelines. Tests state propagation, error containment, and loop termination under adversarial inputs.

PSF coverage

PSF-04, PSF-06

Common failure types

Runaway agent loops
State corruption across handoffs
Silent task abandonment

TDL-04

Customer-Facing Interaction

16 tasks

Support triage, FAQ response, complaint handling, and guided process tasks. Tests guardrail reliability, off-topic containment, and graceful degradation under unusual user inputs.

PSF coverage

PSF-01, PSF-02, PSF-05

Common failure types

Guardrail bypass via indirect phrasing
Off-topic drift without containment
False confidence in ambiguous queries

TDL-05

Code Generation & Review

20 tasks

Production code generation, security review, and refactoring tasks across Python, TypeScript, and SQL. Tests output correctness, security awareness, and behaviour under adversarial prompts.

PSF coverage

PSF-02, PSF-07

Common failure types

Plausible but incorrect logic
Omitted security considerations
Prompt injection via code comments

TDL-06

Observability & Graceful Degradation

14 tasks

Load variation, API failure injection, and context window stress tests. Tests logging completeness, fallback activation, and behaviour under vendor outage simulation.

PSF coverage

PSF-04, PSF-08

Common failure types

Silent failures without log emission
No fallback on provider timeout
Context truncation without warning

Methodology

How we test

The full methodology is published openly. There is nothing proprietary about our test approach — the value is in the disciplined, consistent execution.

Task library construction

Each task domain contains between 14 and 24 test tasks drawn from real production use cases. Tasks have deterministic ground truth — we know what correct output looks like and can score mechanically.

PSF dimension mapping

Every task is tagged to one or more PSF dimensions. Scoring rubrics define what constitutes a pass, partial pass, or fail for each dimension per task type.

Controlled execution

Fixed temperature (0.2), fixed system prompts, no chain-of-thought scaffolding unless scaffolding is under test. Each task runs 5 times; we report median and flag variance.

Adversarial pass

After the standard pass, a subset of tasks is re-run with adversarial inputs — prompt injection attempts, boundary-edge inputs, malformed schemas, and unusual phrasing.

Scoring and aggregation

Per-task scores aggregate to per-domain scores, then to an overall PSF Reliability Index (0–100). Each domain counts equally. Methodology and raw data are published.

Open publication

Findings, methodology, task library structure, and scoring rubrics are published openly. No payment from vendors. All models assessed on identical infrastructure.

Independence guarantee

PAI Lab does not accept payment from vendors for assessments. Models are not given advance access to the task library. Test infrastructure is PAI-owned and not shared with any vendor. Scoring rubrics are published before testing begins. Findings are published regardless of outcome. We have no financial relationship with any model provider.

Framework Stress Tests

Live testing behind the ecosystem assessments

Every framework in our ecosystem assessment programme is stress-tested against TDL-03 (Multi-Agent Orchestration) under production-like conditions.

LangChain

71/100

Agent loop termination failures under adversarial inputs in 4.2% of runs

Full assessment →

LangGraph

76/100

Strong state management. Edge case: graph cycle detection missed in 2.1% of stress runs

Full assessment →

CrewAI

68/100

Role delegation reliability gaps under high-task-count orchestration

Full assessment →

AutoGen

73/100

Human-in-loop interrupt handling reliable. Memory persistence inconsistent across agents

Full assessment →

Composio

79/100

Best-in-cohort for tool call error handling. Logging coverage below average

Full assessment →

Cursor SDK

66/100

Designed for development context — production deployment safety requires additional instrumentation

Full assessment →

Pydantic AI

in_progress

Testing Q3 2026

DSPy

scheduled

Scheduled Q3 2026

Submit a task for the library

Have a production AI task that belongs in the test library? Tasks must have deterministic ground truth. We review submissions quarterly.

Submit a task proposal →

Request a private assessment

Organisations can commission a private Lab assessment of their specific deployment stack against PSF criteria.

Contact the Lab team →

Cite our findings

Lab findings are free to cite. Link to the specific scorecard and note the testing period — scores change across quarters.

Citation guidance →

Know what your AI stack actually does before you deploy it

Read the full methodology, review the Q2 2026 scorecards, and use the PSF self-assessment to benchmark your own deployment.

Run the PSF self-assessment →Back to Research hub

We test production AI systemsso you know exactly what you're deploying

Q2 2026 key findings

Production workflows we've run through the Lab

Invoice Processing Pipeline

Loan Assessment — Decision Support

Customer Support Triage + Auto-Resolution

API Failure Recovery + Graceful Degradation

Q2 2026 — PSF reliability index

General capability & performance benchmarks

Artificial Analysis LLM Leaderboard

LMSYS Chatbot Arena

Open LLM Leaderboard (HuggingFace)

113 tasks across 6 production domains

Document Processing

High-Stakes Decision Support

Multi-Agent Orchestration

Customer-Facing Interaction

Code Generation & Review

Observability & Graceful Degradation

How we test

Task library construction

PSF dimension mapping

Controlled execution

Adversarial pass

Scoring and aggregation

Open publication

Independence guarantee

Live testing behind the ecosystem assessments

LangChain

LangGraph

CrewAI

AutoGen

Composio

Cursor SDK

Pydantic AI

DSPy

Submit a task for the library

Request a private assessment

Cite our findings

Know what your AI stack actually does before you deploy it

We test production AI systems
so you know exactly what you're deploying