PAI Lab scorecards — Q2 2026
Source evidence cited by graph records. This page shows where the source is used, its trust tier, and when it was last checked in the seed.
Records using this source
- Anthropic
entity | 15 June 2026 | 70%
Anthropic — vendor tracked in the Production AI Institute AI Data Use Index.
- Google
entity | 15 June 2026 | 70%
Google — vendor tracked in the Production AI Institute AI Data Use Index.
- OpenAI
entity | 15 June 2026 | 70%
OpenAI — vendor tracked in the Production AI Institute AI Data Use Index.
- Claude Sonnet 4.6
entity | 30 Apr 2026 | 82%
Highest human oversight trigger accuracy in the current cohort. Observability logging incomplete under high-load simulation. Consistent refusal behaviour.
- Gemini 1.5 Pro
entity | 30 Apr 2026 | 82%
Consistent mid-range performer. Weakest in security posture (PSF-07) — code generation tasks showed higher prompt injection susceptibility. Context window handling needs attention.
- GPT-4.1
entity | 30 Apr 2026 | 82%
Strong on structured output adherence. Notable gap: PII handling in summarisation tasks (PSF-03). Escalation trigger reliability above average.
- Llama 3.1 70B
entity | 30 Apr 2026 | 82%
Data protection (PSF-03) outperforms proprietary models in self-hosted configuration — no third-party data egress. Observability and security posture require significant investment at the deployment layer.
- Meta (self-hosted)
entity | 30 Apr 2026 | 72%
Meta (self-hosted) — model provider named in the PAI Lab scorecard registry.
- Claude Sonnet 4.6 — Q2 2026 Lab benchmark
event | 30 Apr 2026 | 82%
Claude Sonnet 4.6 scored 79/100 overall in the Q2 2026 PAI Lab PSF reliability index. Highest human oversight trigger accuracy in the current cohort. Observability logging incomplete under high-load simulation. Consistent refusal behaviour.
- Gemini 1.5 Pro — Q2 2026 Lab benchmark
event | 30 Apr 2026 | 82%
Gemini 1.5 Pro scored 71/100 overall in the Q2 2026 PAI Lab PSF reliability index. Consistent mid-range performer. Weakest in security posture (PSF-07) — code generation tasks showed higher prompt injection susceptibility. Context window handling needs attention.
- GPT-4.1 — Q2 2026 Lab benchmark
event | 30 Apr 2026 | 82%
GPT-4.1 scored 74/100 overall in the Q2 2026 PAI Lab PSF reliability index. Strong on structured output adherence. Notable gap: PII handling in summarisation tasks (PSF-03). Escalation trigger reliability above average.
- Llama 3.1 70B — Q2 2026 Lab benchmark
event | 30 Apr 2026 | 82%
Llama 3.1 70B scored 63/100 overall in the Q2 2026 PAI Lab PSF reliability index. Data protection (PSF-03) outperforms proprietary models in self-hosted configuration — no third-party data egress. Observability and security posture require significant investment at the deployment layer.
- Claude Sonnet 4.6 — PSF scorecard
entity | 30 Apr 2026 | 82%
79/100 overall. Highest human oversight trigger accuracy in the current cohort. Observability logging incomplete under high-load simulation. Consistent refusal behaviour.
- Gemini 1.5 Pro — PSF scorecard
entity | 30 Apr 2026 | 82%
71/100 overall. Consistent mid-range performer. Weakest in security posture (PSF-07) — code generation tasks showed higher prompt injection susceptibility. Context window handling needs attention.
- GPT-4.1 — PSF scorecard
entity | 30 Apr 2026 | 82%
74/100 overall. Strong on structured output adherence. Notable gap: PII handling in summarisation tasks (PSF-03). Escalation trigger reliability above average.
- Llama 3.1 70B — PSF scorecard
entity | 30 Apr 2026 | 82%
63/100 overall. Data protection (PSF-03) outperforms proprietary models in self-hosted configuration — no third-party data egress. Observability and security posture require significant investment at the deployment layer.