What PAI Lab measures: PSF compliance — production safety dimensions. Not general capability, speed, or cost. For latency, throughput, and quality benchmarks, see the external benchmark references below.
Human oversight triggers are the most variable dimension
Across all models tested in Q2 2026, PSF-05 (Human Oversight) showed the widest variance — a 24-point spread between lowest and highest performer. Models that score well here tend to under-claim certainty; those that score poorly tend to complete tasks confidently rather than escalate.
Implication: Practitioners deploying AI in high-stakes decision contexts should run PSF-05 specific testing before deployment regardless of vendor claims.
Self-hosted models outperform on data protection, underperform on observability
Llama 3.1 70B in a self-hosted configuration achieved the highest PSF-03 (Data Protection) score in the cohort — no data leaves the deployment environment. However, PSF-04 (Observability) scores were 12–18 points lower than API-hosted equivalents, reflecting the absence of vendor-provided logging infrastructure.
Implication: Data-sensitive deployments should consider self-hosted models, but must budget for observability tooling investment.
Prompt injection via code comments is consistently underguarded
In TDL-05 (Code Generation), all models in the current cohort allowed prompt injection through carefully crafted code comments in 3–7% of test cases. The attack vector is consistent and reproducible.
Implication: Any production system that processes externally-sourced code as LLM input requires an input sanitisation layer at the application level, not the model level.
Context window stress reveals silent truncation in 2 of 4 models
In TDL-06 (Observability), two models silently truncated context under load without emitting warnings or degraded-mode signals. The other two either raised explicit errors or downgraded to a declared reduced-context mode.
Implication: Silent context truncation is a PSF-04 failure. Production systems should implement explicit context budget management at the application layer.
Each workflow below has been executed against the relevant PAI Lab test domain — real tasks, real inputs, real failure detection. The diagrams show the PSF-compliant architecture we tested. Click any to build or adapt it in Studio.
Every model is tested on identical infrastructure with identical task suites. Scores reflect median performance across 5 runs per task. No model has been given advance access to the test library.
PAI Lab measures PSF compliance — production safety dimensions. For speed, cost, and general capability, these are the references we recommend. They measure different things and are complementary to our scores.
Speed, cost, and quality benchmarks across frontier models. Best source for latency and throughput comparison.
Latency · Throughput · Cost per token · Quality index
Visit leaderboard →
Human preference ELO rankings from millions of blind pairwise comparisons. Best for real-world conversational quality.
ELO ranking · Human preference · Conversational quality
Visit leaderboard →
Academic benchmark suite (MMLU, HellaSwag, ARC, WinoGrande, GSM8K, HumanEval) for open-weight models.
MMLU · Reasoning · Coding · Math
Visit leaderboard →
Tasks are drawn from real production use cases. Each has deterministic ground truth — a correct answer we can score mechanically. The library is versioned and updated annually.
Extraction, classification, and summarisation tasks across contracts, invoices, clinical notes, and regulatory filings. Tests input governance, schema adherence, and output fidelity under document variation.
PSF-01, PSF-02
Loan assessment, triage classification, hiring screening, and risk scoring tasks. Tests output consistency, confidence calibration, and human escalation trigger reliability.
PSF-03, PSF-05
Agent-to-agent handoff sequences across research, code generation, and workflow automation pipelines. Tests state propagation, error containment, and loop termination under adversarial inputs.
PSF-04, PSF-06
Support triage, FAQ response, complaint handling, and guided process tasks. Tests guardrail reliability, off-topic containment, and graceful degradation under unusual user inputs.
PSF-01, PSF-02, PSF-05
Production code generation, security review, and refactoring tasks across Python, TypeScript, and SQL. Tests output correctness, security awareness, and behaviour under adversarial prompts.
PSF-02, PSF-07
Load variation, API failure injection, and context window stress tests. Tests logging completeness, fallback activation, and behaviour under vendor outage simulation.
PSF-04, PSF-08
The full methodology is published openly. There is nothing proprietary about our test approach — the value is in the disciplined, consistent execution.
Each task domain contains between 14 and 24 test tasks drawn from real production use cases. Tasks have deterministic ground truth — we know what correct output looks like and can score mechanically.
Every task is tagged to one or more PSF dimensions. Scoring rubrics define what constitutes a pass, partial pass, or fail for each dimension per task type.
Fixed temperature (0.2), fixed system prompts, no chain-of-thought scaffolding unless scaffolding is under test. Each task runs 5 times; we report median and flag variance.
After the standard pass, a subset of tasks is re-run with adversarial inputs — prompt injection attempts, boundary-edge inputs, malformed schemas, and unusual phrasing.
Per-task scores aggregate to per-domain scores, then to an overall PSF Reliability Index (0–100). Each domain counts equally. Methodology and raw data are published.
Findings, methodology, task library structure, and scoring rubrics are published openly. No payment from vendors. All models assessed on identical infrastructure.
PAI Lab does not accept payment from vendors for assessments. Models are not given advance access to the task library. Test infrastructure is PAI-owned and not shared with any vendor. Scoring rubrics are published before testing begins. Findings are published regardless of outcome. We have no financial relationship with any model provider.
Every framework in our ecosystem assessment programme is stress-tested against TDL-03 (Multi-Agent Orchestration) under production-like conditions.
Agent loop termination failures under adversarial inputs in 4.2% of runs
Full assessment →Strong state management. Edge case: graph cycle detection missed in 2.1% of stress runs
Full assessment →Human-in-loop interrupt handling reliable. Memory persistence inconsistent across agents
Full assessment →Best-in-cohort for tool call error handling. Logging coverage below average
Full assessment →Designed for development context — production deployment safety requires additional instrumentation
Full assessment →Testing Q3 2026
Scheduled Q3 2026
Have a production AI task that belongs in the test library? Tasks must have deterministic ground truth. We review submissions quarterly.
Submit a task proposal →Organisations can commission a private Lab assessment of their specific deployment stack against PSF criteria.
Contact the Lab team →Lab findings are free to cite. Link to the specific scorecard and note the testing period — scores change across quarters.
Citation guidance →Read the full methodology, review the Q2 2026 scorecards, and use the PSF self-assessment to benchmark your own deployment.