Production AI Institute · PSF v1.1 open standard
AI Right-To-KnowAI Data Use IndexCheck My AI ToolsPolicy Change WatchAgent ReadinessPublic BenchmarkContactGlobal standard · Worldwide
Insights/Lab

PAI Lab report: public GitHub agent readiness (May 2026)

A frozen cohort scan of 20 public AI agent repositories scored against visible Production Safety Framework evidence. Mean evidence coverage was 38%. Human oversight signals appeared in only 4 of 20 projects.

Production AI Institute · 11 min read · Published June 2026 · Methodology PAI-ARI-2026.1

Scope disclosure: This report uses the Agent Readiness Index public-repository scanner (methodology PAI-ARI-2026.1). It inspects public metadata and file-tree paths only. It is not the PAI Lab 113-task multi-run model battery (scheduled Q3 2026 per /lab methodology v1.0). Scores here measure documented evidence visibility, not runtime reliability under adversarial tasks.

The PAI Lab publishes empirical work in two complementary tracks: quarterly model scorecards on PSF Reliability Index (see assessments for GPT-4.1, Claude Sonnet 4.6, and peers), and cohort studies of real deployments. This edition documents the May 2026 public GitHub sample. The immutable dataset lives at /agent-readiness/benchmark/2026-05.

Methodology

Instrument: Agent Readiness Index scanner, version PAI-ARI-2026.1 (full methodology).

Source: GitHub REST repository search (search API documentation), executed 13 May 2026 UTC. No private clones; no endorsement of listed projects.

Selection queries (union, de-duplicated):

  • topic:ai-agent archived:false fork:false stars:>=5
  • topic:agentic-ai archived:false fork:false stars:>=5
  • topic:llm-agent archived:false fork:false stars:>=5
  • topic:mcp-server archived:false fork:false stars:>=5
  • "ai agent" in:name,description,readme archived:false fork:false stars:>=5

Sample: 20 repositories after query union (requested limit 20). Scoring: eight PSF-aligned domains; each domain marked present when public artifacts match predefined signal patterns (AGENTS.md, eval folders, CI gates, approval workflows, etc.). Aggregation: coverage percent = domains with at least one matched signal / 8; cohort mean reported below.

Reproducibility: Edition frozen at publication. Re-running the live benchmark may differ as repositories change; cite the May 2026 snapshot URL for this cohort.

Limits

  • Visibility bias: projects that document controls score higher than equally safe projects that do not publish evidence.
  • Signal patterns miss custom filenames; false negatives are expected.
  • Star threshold (>=5) skews toward newer or promoted repos, not enterprise internal agents.
  • No runtime execution: prompt injection, loop termination, and model-level PSF-06 behaviour are out of scope (see Lab model testing protocol).
  • Not a certification, legal opinion, or security audit.

Results

Cohort summary (n=20): mean evidence coverage 38%; projects with eval-pattern evidence 0 of 20; projects with human-oversight evidence 4 of 20; projects with incident or observability evidence 11 of 20.

PSF domain coverage (visible evidence)

DomainLabelRepos with evidenceShare
D1Input boundary13 / 2065%
D2Output validation8 / 2040%
D3Data stewardship6 / 2030%
D4Observability11 / 2055%
D5Deployment control17 / 2085%
D6Human oversight4 / 2020%
D7Security posture14 / 2070%
D8Ecosystem resilience12 / 2060%

Interpretation: Deployment control (D5) was the most visible dimension (85%), usually via CI workflows or release automation. Human oversight (D6) was the scarcest (20%), aligning with the Q2 2026 Lab finding that PSF-06 is the most variable dimension across frontier models on /lab. Engineering controls (D7 security, D4 observability) outpaced data stewardship (D3, 30%).

Highest-coverage repositories (public sample)

RepositoryCoverageGradeTop signals
serac-labs/serac84%AObservability, human approval gates, security hygiene
vm0-ai/vm077%AObservability, security hygiene, provider fallback
Icarus603/claude-code76%AObservability, human approval gates, security hygiene
HankHuang0516/EClaw63%AObservability, human approval gates, security hygiene
holaboss-ai/holaOS59%AObservability, security hygiene, provider fallback

PSF mapping and Lab cross-links

Domain labels map to the eight PSF domains described in PSF compliance explained. Where public repos lack D6 evidence, practitioners should apply controls from PSF Domain 6: Human Oversight and test escalation behaviour before production, consistent with Lab scorecard gaps on human oversight triggers for API-hosted models.

The zero eval-evidence result does not prove teams skip evaluation; it means the scanner did not find standard eval artifacts (evals/, golden tests, scorecards) in public trees. Teams should treat this as a documentation and operability gap: eval harnesses belong in version control with pinned model versions, as described in PSF Domain 5: Deployment Safety.

For failure-mode context when oversight and validation are weak, see seven failure modes of production AI and the multi-agent amplification analysis.

Practitioner actions

  1. Publish an evidence map: Add docs/production-ai-readiness.md linking each PSF domain to concrete repo artifacts (policies, workflows, runbooks).
  2. Make evals visible: Commit an evals/ directory with regression cases, thresholds, and the model or prompt version under test.
  3. Document human gates: State who approves high-impact tool calls, what is logged, and what fails closed (human-in-the-loop design).
  4. Run your own scan: Use the Agent Readiness Index on your repository; compare to the May 2026 public cohort.
  5. Close model-level gaps separately: Pair documentation work with model scorecard review on /lab before selecting a frontier model for high-stakes workflows.

Sources

Apply the standard

Turn the evidence into production practice.

Use the PSF, research library, and Lab material to review your own deployment. Credentials are available when a client, employer, or regulator needs public proof.

The Production AI Brief