Production AI Institute — vendor-neutral certification for AI practitioners
Verify a credentialFor organisationsNonprofits & NGOsContact
PAI Lab
Active · Q2 2026
Next update due: August 2026

We test production AI systems
so you know exactly what you're deploying

The PAI Lab runs structured reliability tests on frontier AI models and agent frameworks against Production Safety Framework criteria. 113 test tasks across 6 domains. Quarterly scorecards. Fully open methodology. No vendor sponsorship.

View model scorecards →Read the methodology
113
Test tasks
6
Task domains
4
Models tested Q2 2026
8
PSF dimensions scored
Open
Methodology
Quarterly
Cadence

What PAI Lab measures: PSF compliance — production safety dimensions. Not general capability, speed, or cost. For latency, throughput, and quality benchmarks, see the external benchmark references below.

Scores last updated: May 2026

Q2 2026 key findings

Based on 4 models × 113 tasks × 5 runs each
⚠ Critical finding

Human oversight triggers are the most variable dimension

Across all models tested in Q2 2026, PSF-05 (Human Oversight) showed the widest variance — a 24-point spread between lowest and highest performer. Models that score well here tend to under-claim certainty; those that score poorly tend to complete tasks confidently rather than escalate.

Implication: Practitioners deploying AI in high-stakes decision contexts should run PSF-05 specific testing before deployment regardless of vendor claims.

Advisory

Self-hosted models outperform on data protection, underperform on observability

Llama 3.1 70B in a self-hosted configuration achieved the highest PSF-03 (Data Protection) score in the cohort — no data leaves the deployment environment. However, PSF-04 (Observability) scores were 12–18 points lower than API-hosted equivalents, reflecting the absence of vendor-provided logging infrastructure.

Implication: Data-sensitive deployments should consider self-hosted models, but must budget for observability tooling investment.

⚠ Critical finding

Prompt injection via code comments is consistently underguarded

In TDL-05 (Code Generation), all models in the current cohort allowed prompt injection through carefully crafted code comments in 3–7% of test cases. The attack vector is consistent and reproducible.

Implication: Any production system that processes externally-sourced code as LLM input requires an input sanitisation layer at the application level, not the model level.

Advisory

Context window stress reveals silent truncation in 2 of 4 models

In TDL-06 (Observability), two models silently truncated context under load without emitting warnings or degraded-mode signals. The other two either raised explicit errors or downgraded to a declared reduced-context mode.

Implication: Silent context truncation is a PSF-04 failure. Production systems should implement explicit context budget management at the application layer.

Tested Workflows

Production workflows we've run through the Lab

Each workflow below has been executed against the relevant PAI Lab test domain — real tasks, real inputs, real failure detection. The diagrams show the PSF-compliant architecture we tested. Click any to build or adapt it in Studio.

TDL-01 · Document Processing✓ Lab tested Q2 2026

Invoice Processing Pipeline

Email intake → document classification → field extraction → PO matching → conditional human gate. Tested across 24 invoice variants including edge cases (missing PO, partial match, duplicate detection).

PSF-01Input Governance
88
PSF-02Output Validation
83
PSF-05Human Oversight
91
TRIGGERemail_intakeOutlook / SMTP · on_receiveTRIGGER · asyncSKILLdoc_classifierPSF-D1 · schema guardinvoice vs. otherSKILLfield_extractorGPT-4.1 · structured outputInvoiceSchema · strictSKILLpo_matcherPSF-D2 · 3-way matchERP lookup · validatedHUMANfinance_gate>$5k or mismatch · 4h SLAHUMAN · exception onlyPAI Workflow Studio

Key finding: Field hallucination in 2.1% of edge-case invoices — caught by PSF-D2 output validation gate. Human escalation triggered correctly on all exception paths.

Build in Studio →
TDL-02 · High-Stakes Decision✓ Lab tested Q2 2026

Loan Assessment — Decision Support

AI provides risk score and rationale. Human underwriter makes every final decision — no autonomous approvals. Parallel explainability track generates audit trail for every assessment.

PSF-03Data Protection
79
PSF-05Human Oversight
96
PSF-04Observability
74
TRIGGERapplication_inportal · on_submitTRIGGER · on-demandSKILLeligibility_checkPSF-D1 · PII guardschema + redactionSKILLrisk_scorerPSF-D2 · confidence gate0.0–1.0 · calibratedSKILLexplainerrationale · audit trailPSF-D4 · loggedHUMANunderwriter_reviewMANDATORY · all decisionsHUMAN · requiredPAI Workflow Studio

Advisory: Confidence miscalibration in 4.8% of borderline cases — model over-confident on inputs near the decision boundary. Observability logging missing on 11% of edge paths.

Build in Studio →
TDL-04 · Customer-Facing✓ Lab tested Q2 2026

Customer Support Triage + Auto-Resolution

Priority routing with P1 escalation to human agents. P2/P3 tickets auto-resolved. Tested with adversarial inputs including indirect guardrail bypass attempts and off-topic injections.

PSF-01Input Governance
84
PSF-02Output Validation
81
PSF-05Human Oversight
88
TRIGGERticket_createdwebhook · Zendesk / LinearTRIGGER · webhookSKILLintent_classifierL2 autonomy · monitorsClaude Haiku · fastCONDITIONpriority_routerP1 / P2 / P3 branchCONDITION · gateHUMANescalation_gate1h SLA · P1 critical onlyHUMAN · P1 pathSKILLauto_resolveL3 autonomy · override OKP2-P3 pathPAI Workflow Studio

Critical: Guardrail bypass via indirect phrasing succeeded in 3.2% of adversarial runs. Mitigation: add semantic similarity check at input classification layer.

Build in Studio →
TDL-06 · Observability✓ Lab tested Q2 2026

API Failure Recovery + Graceful Degradation

Continuous health monitoring with automatic failover to secondary provider and degraded-mode fallback. Tests PSF-D4 (Observability) and PSF-D8 (Vendor Resilience) under simulated provider outage.

PSF-04Observability
71
PSF-08Vendor Resilience
77
PSF-06Deployment Safety
82
TRIGGERhealth_monitor30s interval · PSF-D4TRIGGER · continuousCONDITIONfailure_detectortimeout / 5xx / nullCONDITION · thresholdSKILLfallback_routerPSF-D8 · vendor failoversecondary providerSKILLdegraded_modecached · reduced scopePSF-D6 · safe stateINTEGRATIONalert_dispatchPagerDuty · Slack · loggedPSF-D4 · audit emitPAI Workflow Studio

Advisory: Silent context truncation observed on 2 of 4 tested model providers under maximum context load. PSF-D4 requires explicit warning emission — add application-layer context budget management.

Build in Studio →
Adapt any of these workflows in PAI Studio →
Model Scorecards

Q2 2026 — PSF reliability index

Every model is tested on identical infrastructure with identical task suites. Scores reflect median performance across 5 runs per task. No model has been given advance access to the test library.

OpenAI
GPT-4.1
Tested Q2 2026
74
PSF Index / 100
Input GovernancePSF-01
81
Output ValidationPSF-02
76
Data ProtectionPSF-03
68
ObservabilityPSF-04
71
Human Oversight TriggersPSF-05
79
Deployment SafetyPSF-06
72
Security PosturePSF-07
69
Vendor ResiliencePSF-08
77

Lab note: Strong on structured output adherence. Notable gap: PII handling in summarisation tasks (PSF-03). Escalation trigger reliability above average.

Anthropic
Claude Sonnet 4.6
Tested Q2 2026
79
PSF Index / 100
Input GovernancePSF-01
84
Output ValidationPSF-02
82
Data ProtectionPSF-03
77
ObservabilityPSF-04
73
Human Oversight TriggersPSF-05
85
Deployment SafetyPSF-06
78
Security PosturePSF-07
74
Vendor ResiliencePSF-08
79

Lab note: Highest human oversight trigger accuracy in the current cohort. Observability logging incomplete under high-load simulation. Consistent refusal behaviour.

Google
Gemini 1.5 Pro
Tested Q2 2026
71
PSF Index / 100
Input GovernancePSF-01
75
Output ValidationPSF-02
73
Data ProtectionPSF-03
72
ObservabilityPSF-04
68
Human Oversight TriggersPSF-05
70
Deployment SafetyPSF-06
71
Security PosturePSF-07
66
Vendor ResiliencePSF-08
73

Lab note: Consistent mid-range performer. Weakest in security posture (PSF-07) — code generation tasks showed higher prompt injection susceptibility. Context window handling needs attention.

Meta (self-hosted)
Llama 3.1 70B
Tested Q2 2026
63
PSF Index / 100
Input GovernancePSF-01
67
Output ValidationPSF-02
64
Data ProtectionPSF-03
71
ObservabilityPSF-04
59
Human Oversight TriggersPSF-05
61
Deployment SafetyPSF-06
62
Security PosturePSF-07
58
Vendor ResiliencePSF-08
62

Lab note: Data protection (PSF-03) outperforms proprietary models in self-hosted configuration — no third-party data egress. Observability and security posture require significant investment at the deployment layer.

Scheduled — next quarter
Mistral
Mistral Large 2
Testing begins Q3 2026
Cohere
Command R+
Testing begins Q3 2026
External References

General capability & performance benchmarks

PAI Lab measures PSF compliance — production safety dimensions. For speed, cost, and general capability, these are the references we recommend. They measure different things and are complementary to our scores.

Artificial Analysis LLM Leaderboard

Continuous

Speed, cost, and quality benchmarks across frontier models. Best source for latency and throughput comparison.

Latency · Throughput · Cost per token · Quality index

Visit leaderboard →

LMSYS Chatbot Arena

Continuous

Human preference ELO rankings from millions of blind pairwise comparisons. Best for real-world conversational quality.

ELO ranking · Human preference · Conversational quality

Visit leaderboard →

Open LLM Leaderboard (HuggingFace)

Weekly

Academic benchmark suite (MMLU, HellaSwag, ARC, WinoGrande, GSM8K, HumanEval) for open-weight models.

MMLU · Reasoning · Coding · Math

Visit leaderboard →

Test Task Library

113 tasks across 6 production domains

Tasks are drawn from real production use cases. Each has deterministic ground truth — a correct answer we can score mechanically. The library is versioned and updated annually.

TDL-01

Document Processing

24 tasks

Extraction, classification, and summarisation tasks across contracts, invoices, clinical notes, and regulatory filings. Tests input governance, schema adherence, and output fidelity under document variation.

PSF coverage

PSF-01, PSF-02

Common failure types
  • Hallucinated field values
  • Schema drift under format variation
  • PII leakage in summarisation output
TDL-02

High-Stakes Decision Support

18 tasks

Loan assessment, triage classification, hiring screening, and risk scoring tasks. Tests output consistency, confidence calibration, and human escalation trigger reliability.

PSF coverage

PSF-03, PSF-05

Common failure types
  • Confidence miscalibration
  • Inconsistent outputs across equivalent inputs
  • Missing escalation triggers on edge cases
TDL-03

Multi-Agent Orchestration

21 tasks

Agent-to-agent handoff sequences across research, code generation, and workflow automation pipelines. Tests state propagation, error containment, and loop termination under adversarial inputs.

PSF coverage

PSF-04, PSF-06

Common failure types
  • Runaway agent loops
  • State corruption across handoffs
  • Silent task abandonment
TDL-04

Customer-Facing Interaction

16 tasks

Support triage, FAQ response, complaint handling, and guided process tasks. Tests guardrail reliability, off-topic containment, and graceful degradation under unusual user inputs.

PSF coverage

PSF-01, PSF-02, PSF-05

Common failure types
  • Guardrail bypass via indirect phrasing
  • Off-topic drift without containment
  • False confidence in ambiguous queries
TDL-05

Code Generation & Review

20 tasks

Production code generation, security review, and refactoring tasks across Python, TypeScript, and SQL. Tests output correctness, security awareness, and behaviour under adversarial prompts.

PSF coverage

PSF-02, PSF-07

Common failure types
  • Plausible but incorrect logic
  • Omitted security considerations
  • Prompt injection via code comments
TDL-06

Observability & Graceful Degradation

14 tasks

Load variation, API failure injection, and context window stress tests. Tests logging completeness, fallback activation, and behaviour under vendor outage simulation.

PSF coverage

PSF-04, PSF-08

Common failure types
  • Silent failures without log emission
  • No fallback on provider timeout
  • Context truncation without warning
Methodology

How we test

The full methodology is published openly. There is nothing proprietary about our test approach — the value is in the disciplined, consistent execution.

01

Task library construction

Each task domain contains between 14 and 24 test tasks drawn from real production use cases. Tasks have deterministic ground truth — we know what correct output looks like and can score mechanically.

02

PSF dimension mapping

Every task is tagged to one or more PSF dimensions. Scoring rubrics define what constitutes a pass, partial pass, or fail for each dimension per task type.

03

Controlled execution

Fixed temperature (0.2), fixed system prompts, no chain-of-thought scaffolding unless scaffolding is under test. Each task runs 5 times; we report median and flag variance.

04

Adversarial pass

After the standard pass, a subset of tasks is re-run with adversarial inputs — prompt injection attempts, boundary-edge inputs, malformed schemas, and unusual phrasing.

05

Scoring and aggregation

Per-task scores aggregate to per-domain scores, then to an overall PSF Reliability Index (0–100). Each domain counts equally. Methodology and raw data are published.

06

Open publication

Findings, methodology, task library structure, and scoring rubrics are published openly. No payment from vendors. All models assessed on identical infrastructure.

Independence guarantee

PAI Lab does not accept payment from vendors for assessments. Models are not given advance access to the task library. Test infrastructure is PAI-owned and not shared with any vendor. Scoring rubrics are published before testing begins. Findings are published regardless of outcome. We have no financial relationship with any model provider.

Framework Stress Tests

Live testing behind the ecosystem assessments

Every framework in our ecosystem assessment programme is stress-tested against TDL-03 (Multi-Agent Orchestration) under production-like conditions.

LangChain

71/100

Agent loop termination failures under adversarial inputs in 4.2% of runs

Full assessment →

LangGraph

76/100

Strong state management. Edge case: graph cycle detection missed in 2.1% of stress runs

Full assessment →

CrewAI

68/100

Role delegation reliability gaps under high-task-count orchestration

Full assessment →

AutoGen

73/100

Human-in-loop interrupt handling reliable. Memory persistence inconsistent across agents

Full assessment →

Composio

79/100

Best-in-cohort for tool call error handling. Logging coverage below average

Full assessment →

Cursor SDK

66/100

Designed for development context — production deployment safety requires additional instrumentation

Full assessment →

Pydantic AI

in_progress

Testing Q3 2026

DSPy

scheduled

Scheduled Q3 2026

Submit a task for the library

Have a production AI task that belongs in the test library? Tasks must have deterministic ground truth. We review submissions quarterly.

Submit a task proposal →

Request a private assessment

Organisations can commission a private Lab assessment of their specific deployment stack against PSF criteria.

Contact the Lab team →

Cite our findings

Lab findings are free to cite. Link to the specific scorecard and note the testing period — scores change across quarters.

Citation guidance →

Know what your AI stack actually does before you deploy it

Read the full methodology, review the Q2 2026 scorecards, and use the PSF self-assessment to benchmark your own deployment.

Run the PSF self-assessment →Back to Research hub