Scope disclosure: This report uses the Agent Readiness Index public-repository scanner (methodology PAI-ARI-2026.1). It inspects public metadata and file-tree paths only. It is not the PAI Lab 113-task multi-run model battery (scheduled Q3 2026 per /lab methodology v1.0). Scores here measure documented evidence visibility, not runtime reliability under adversarial tasks.
The PAI Lab publishes empirical work in two complementary tracks: quarterly model scorecards on PSF Reliability Index (see assessments for GPT-4.1, Claude Sonnet 4.6, and peers), and cohort studies of real deployments. This edition documents the May 2026 public GitHub sample. The immutable dataset lives at /agent-readiness/benchmark/2026-05.
Methodology
Instrument: Agent Readiness Index scanner, version PAI-ARI-2026.1 (full methodology).
Source: GitHub REST repository search (search API documentation), executed 13 May 2026 UTC. No private clones; no endorsement of listed projects.
Selection queries (union, de-duplicated):
topic:ai-agent archived:false fork:false stars:>=5topic:agentic-ai archived:false fork:false stars:>=5topic:llm-agent archived:false fork:false stars:>=5topic:mcp-server archived:false fork:false stars:>=5"ai agent" in:name,description,readme archived:false fork:false stars:>=5
Sample: 20 repositories after query union (requested limit 20). Scoring: eight PSF-aligned domains; each domain marked present when public artifacts match predefined signal patterns (AGENTS.md, eval folders, CI gates, approval workflows, etc.). Aggregation: coverage percent = domains with at least one matched signal / 8; cohort mean reported below.
Reproducibility: Edition frozen at publication. Re-running the live benchmark may differ as repositories change; cite the May 2026 snapshot URL for this cohort.
Limits
- Visibility bias: projects that document controls score higher than equally safe projects that do not publish evidence.
- Signal patterns miss custom filenames; false negatives are expected.
- Star threshold (>=5) skews toward newer or promoted repos, not enterprise internal agents.
- No runtime execution: prompt injection, loop termination, and model-level PSF-06 behaviour are out of scope (see Lab model testing protocol).
- Not a certification, legal opinion, or security audit.
Results
Cohort summary (n=20): mean evidence coverage 38%; projects with eval-pattern evidence 0 of 20; projects with human-oversight evidence 4 of 20; projects with incident or observability evidence 11 of 20.
PSF domain coverage (visible evidence)
| Domain | Label | Repos with evidence | Share |
|---|---|---|---|
| D1 | Input boundary | 13 / 20 | 65% |
| D2 | Output validation | 8 / 20 | 40% |
| D3 | Data stewardship | 6 / 20 | 30% |
| D4 | Observability | 11 / 20 | 55% |
| D5 | Deployment control | 17 / 20 | 85% |
| D6 | Human oversight | 4 / 20 | 20% |
| D7 | Security posture | 14 / 20 | 70% |
| D8 | Ecosystem resilience | 12 / 20 | 60% |
Interpretation: Deployment control (D5) was the most visible dimension (85%), usually via CI workflows or release automation. Human oversight (D6) was the scarcest (20%), aligning with the Q2 2026 Lab finding that PSF-06 is the most variable dimension across frontier models on /lab. Engineering controls (D7 security, D4 observability) outpaced data stewardship (D3, 30%).
Highest-coverage repositories (public sample)
| Repository | Coverage | Grade | Top signals |
|---|---|---|---|
serac-labs/serac | 84% | A | Observability, human approval gates, security hygiene |
vm0-ai/vm0 | 77% | A | Observability, security hygiene, provider fallback |
Icarus603/claude-code | 76% | A | Observability, human approval gates, security hygiene |
HankHuang0516/EClaw | 63% | A | Observability, human approval gates, security hygiene |
holaboss-ai/holaOS | 59% | A | Observability, security hygiene, provider fallback |
PSF mapping and Lab cross-links
Domain labels map to the eight PSF domains described in PSF compliance explained. Where public repos lack D6 evidence, practitioners should apply controls from PSF Domain 6: Human Oversight and test escalation behaviour before production, consistent with Lab scorecard gaps on human oversight triggers for API-hosted models.
The zero eval-evidence result does not prove teams skip evaluation; it means the scanner did not find standard eval artifacts (evals/, golden tests, scorecards) in public trees. Teams should treat this as a documentation and operability gap: eval harnesses belong in version control with pinned model versions, as described in PSF Domain 5: Deployment Safety.
For failure-mode context when oversight and validation are weak, see seven failure modes of production AI and the multi-agent amplification analysis.
Practitioner actions
- Publish an evidence map: Add
docs/production-ai-readiness.mdlinking each PSF domain to concrete repo artifacts (policies, workflows, runbooks). - Make evals visible: Commit an
evals/directory with regression cases, thresholds, and the model or prompt version under test. - Document human gates: State who approves high-impact tool calls, what is logged, and what fails closed (human-in-the-loop design).
- Run your own scan: Use the Agent Readiness Index on your repository; compare to the May 2026 public cohort.
- Close model-level gaps separately: Pair documentation work with model scorecard review on /lab before selecting a frontier model for high-stakes workflows.
Sources
- PAI Agent Readiness snapshot 2026-05 (frozen 13 May 2026): /agent-readiness/benchmark/2026-05
- PAI Agent Readiness methodology PAI-ARI-2026.1: /agent-readiness/methodology
- GitHub REST search API: Repository search
- PAI Lab scorecards and Q2 2026 cohort: /lab
- Production Safety Framework: /standard
Turn the evidence into production practice.
Use the PSF, research library, and Lab material to review your own deployment. Credentials are available when a client, employer, or regulator needs public proof.