PAI Lab Report: Public GitHub Agent Readiness, May 2026

Scope disclosure: This report uses the Agent Readiness Index public-repository scanner (methodology PAI-ARI-2026.1). It inspects public metadata and file-tree paths only. It is not the PAI Lab 113-task multi-run model battery (scheduled Q3 2026 per /lab methodology v1.0). Scores here measure documented evidence visibility, not runtime reliability under adversarial tasks.

The PAI Lab publishes empirical work in two complementary tracks: quarterly model scorecards on PSF Reliability Index (see assessments for GPT-4.1, Claude Sonnet 4.6, and peers), and cohort studies of real deployments. This edition documents the May 2026 public GitHub sample. The immutable dataset lives at /agent-readiness/benchmark/2026-05.

Methodology

Instrument: Agent Readiness Index scanner, version PAI-ARI-2026.1 (full methodology).

Source: GitHub REST repository search (search API documentation), executed 13 May 2026 UTC. No private clones; no endorsement of listed projects.

Selection queries (union, de-duplicated):

topic:ai-agent archived:false fork:false stars:>=5
topic:agentic-ai archived:false fork:false stars:>=5
topic:llm-agent archived:false fork:false stars:>=5
topic:mcp-server archived:false fork:false stars:>=5
"ai agent" in:name,description,readme archived:false fork:false stars:>=5

Sample: 20 repositories after query union (requested limit 20). Scoring: eight PSF-aligned domains; each domain marked present when public artifacts match predefined signal patterns (AGENTS.md, eval folders, CI gates, approval workflows, etc.). Aggregation: coverage percent = domains with at least one matched signal / 8; cohort mean reported below.

Reproducibility: Edition frozen at publication. Re-running the live benchmark may differ as repositories change; cite the May 2026 snapshot URL for this cohort.

Limits

Visibility bias: projects that document controls score higher than equally safe projects that do not publish evidence.
Signal patterns miss custom filenames; false negatives are expected.
Star threshold (>=5) skews toward newer or promoted repos, not enterprise internal agents.
No runtime execution: prompt injection, loop termination, and model-level PSF-06 behaviour are out of scope (see Lab model testing protocol).
Not a certification, legal opinion, or security audit.

Results

Cohort summary (n=20): mean evidence coverage 38%; projects with eval-pattern evidence 0 of 20; projects with human-oversight evidence 4 of 20; projects with incident or observability evidence 11 of 20.

PSF domain coverage (visible evidence)

Domain	Label	Repos with evidence	Share
D1	Input boundary	13 / 20	65%
D2	Output validation	8 / 20	40%
D3	Data stewardship	6 / 20	30%
D4	Observability	11 / 20	55%
D5	Deployment control	17 / 20	85%
D6	Human oversight	4 / 20	20%
D7	Security posture	14 / 20	70%
D8	Ecosystem resilience	12 / 20	60%

Interpretation: Deployment control (D5) was the most visible dimension (85%), usually via CI workflows or release automation. Human oversight (D6) was the scarcest (20%), aligning with the Q2 2026 Lab finding that PSF-06 is the most variable dimension across frontier models on /lab. Engineering controls (D7 security, D4 observability) outpaced data stewardship (D3, 30%).

Highest-coverage repositories (public sample)

Repository	Coverage	Grade	Top signals
`serac-labs/serac`	84%	A	Observability, human approval gates, security hygiene
`vm0-ai/vm0`	77%	A	Observability, security hygiene, provider fallback
`Icarus603/claude-code`	76%	A	Observability, human approval gates, security hygiene
`HankHuang0516/EClaw`	63%	A	Observability, human approval gates, security hygiene
`holaboss-ai/holaOS`	59%	A	Observability, security hygiene, provider fallback

PSF mapping and Lab cross-links

Domain labels map to the eight PSF domains described in PSF compliance explained. Where public repos lack D6 evidence, practitioners should apply controls from PSF Domain 6: Human Oversight and test escalation behaviour before production, consistent with Lab scorecard gaps on human oversight triggers for API-hosted models.

The zero eval-evidence result does not prove teams skip evaluation; it means the scanner did not find standard eval artifacts (evals/, golden tests, scorecards) in public trees. Teams should treat this as a documentation and operability gap: eval harnesses belong in version control with pinned model versions, as described in PSF Domain 5: Deployment Safety.

For failure-mode context when oversight and validation are weak, see seven failure modes of production AI and the multi-agent amplification analysis.

Practitioner actions

Publish an evidence map: Add docs/production-ai-readiness.md linking each PSF domain to concrete repo artifacts (policies, workflows, runbooks).
Make evals visible: Commit an evals/ directory with regression cases, thresholds, and the model or prompt version under test.
Document human gates: State who approves high-impact tool calls, what is logged, and what fails closed (human-in-the-loop design).
Run your own scan: Use the Agent Readiness Index on your repository; compare to the May 2026 public cohort.
Close model-level gaps separately: Pair documentation work with model scorecard review on /lab before selecting a frontier model for high-stakes workflows.

Sources

PAI Agent Readiness snapshot 2026-05 (frozen 13 May 2026): /agent-readiness/benchmark/2026-05
PAI Agent Readiness methodology PAI-ARI-2026.1: /agent-readiness/methodology
GitHub REST search API: Repository search
PAI Lab scorecards and Q2 2026 cohort: /lab
Production Safety Framework: /standard

Public record

This record is maintained by PAI and free to cite. If something is wrong or missing, tell us. Corrections and source suggestions keep the record honest.

Follow policy changes ->Save a watch ->Submit a correction

Records are free to cite. citation guidance.

Methodology

Instrument: Agent Readiness Index scanner, version PAI-ARI-2026.1 (full methodology).

Source: GitHub REST repository search (search API documentation), executed 13 May 2026 UTC. No private clones; no endorsement of listed projects.

Selection queries (union, de-duplicated):

topic:ai-agent archived:false fork:false stars:>=5
topic:agentic-ai archived:false fork:false stars:>=5
topic:llm-agent archived:false fork:false stars:>=5
topic:mcp-server archived:false fork:false stars:>=5
"ai agent" in:name,description,readme archived:false fork:false stars:>=5

Reproducibility: Edition frozen at publication. Re-running the live benchmark may differ as repositories change; cite the May 2026 snapshot URL for this cohort.

Limits

Visibility bias: projects that document controls score higher than equally safe projects that do not publish evidence.
Signal patterns miss custom filenames; false negatives are expected.
Star threshold (>=5) skews toward newer or promoted repos, not enterprise internal agents.
No runtime execution: prompt injection, loop termination, and model-level PSF-06 behaviour are out of scope (see Lab model testing protocol).
Not a certification, legal opinion, or security audit.

Results

PSF domain coverage (visible evidence)

Domain	Label	Repos with evidence	Share
D1	Input boundary	13 / 20	65%
D2	Output validation	8 / 20	40%
D3	Data stewardship	6 / 20	30%
D4	Observability	11 / 20	55%
D5	Deployment control	17 / 20	85%
D6	Human oversight	4 / 20	20%
D7	Security posture	14 / 20	70%
D8	Ecosystem resilience	12 / 20	60%

Highest-coverage repositories (public sample)

Repository	Coverage	Grade	Top signals
`serac-labs/serac`	84%	A	Observability, human approval gates, security hygiene
`vm0-ai/vm0`	77%	A	Observability, security hygiene, provider fallback
`Icarus603/claude-code`	76%	A	Observability, human approval gates, security hygiene
`HankHuang0516/EClaw`	63%	A	Observability, human approval gates, security hygiene
`holaboss-ai/holaOS`	59%	A	Observability, security hygiene, provider fallback

PSF mapping and Lab cross-links

For failure-mode context when oversight and validation are weak, see seven failure modes of production AI and the multi-agent amplification analysis.

Practitioner actions

Publish an evidence map: Add docs/production-ai-readiness.md linking each PSF domain to concrete repo artifacts (policies, workflows, runbooks).
Make evals visible: Commit an evals/ directory with regression cases, thresholds, and the model or prompt version under test.
Document human gates: State who approves high-impact tool calls, what is logged, and what fails closed (human-in-the-loop design).
Run your own scan: Use the Agent Readiness Index on your repository; compare to the May 2026 public cohort.
Close model-level gaps separately: Pair documentation work with model scorecard review on /lab before selecting a frontier model for high-stakes workflows.

Sources

PAI Agent Readiness snapshot 2026-05 (frozen 13 May 2026): /agent-readiness/benchmark/2026-05
PAI Agent Readiness methodology PAI-ARI-2026.1: /agent-readiness/methodology
GitHub REST search API: Repository search
PAI Lab scorecards and Q2 2026 cohort: /lab
Production Safety Framework: /standard

Public record

This record is maintained by PAI and free to cite. If something is wrong or missing, tell us. Corrections and source suggestions keep the record honest.

Follow policy changes ->Save a watch ->Submit a correction

Records are free to cite. citation guidance.

PAI Lab report: public GitHub agent readiness (May 2026)

Methodology

Limits

Results

PSF domain coverage (visible evidence)

Highest-coverage repositories (public sample)

PSF mapping and Lab cross-links

Practitioner actions

Sources

PAI Lab report: public GitHub agent readiness (May 2026)

Methodology

Limits

Results

PSF domain coverage (visible evidence)

Highest-coverage repositories (public sample)

PSF mapping and Lab cross-links

Practitioner actions

Sources