PAI Agent Readiness Index

Benchmark an AI agent against the production standard.

A PSF-aligned readiness report for agentic systems: self-assessment, optional public GitHub evidence scan, domain scores, critical gaps, and a shareable badge that does not pretend to be certification.

PSF v1.1 methodologyEvidence gradePublic report URLREADME badge
Readiness console

Generate the report

The Index weights PSF controls, public evidence signals, and operational context. Repository scanning is limited to public GitHub file paths.

0/24 checks
Public report and badgePublic reports create a citeable URL and an embeddable badge. This is not a credential.
D1

Input boundary

Scope, allowed sources, abuse controls, and prompt injection boundaries.

Are the agent's allowed inputs, sources, and operating scope documented?Scope docs, API contracts, product policy, or operating instructions.
Are prompt injection, untrusted content, and boundary-crossing cases tested?Red-team cases, eval fixtures, attack tests, or review notes.
Are abuse controls such as rate limits, identity checks, or input filters in place?Rate limiting, auth checks, input filters, abuse monitoring, or gateway controls.
D2

Output validation

Contracts, schemas, refusals, confidence thresholds, and failure paths.

Are outputs validated against schemas, contracts, policies, or acceptance criteria before action?JSON schema, Zod, Pydantic, validators, policy checks, or acceptance tests.
Does the system have explicit refusal, fallback, or escalation paths when confidence is low?Refusal policy, fallback branch, confidence thresholds, escalation queue, or retries.
Are output failures measured and reviewed through evals or production checks?Eval dashboards, failed-output reviews, test reports, or incident tags.
D3

Data stewardship

Classification, minimisation, retention, redaction, and vendor data access.

Is sensitive data classified before it enters prompts, tools, logs, or model providers?Data classification matrix, prompt data policy, or provider access map.
Are minimisation, retention, and redaction rules documented and enforced?Retention config, redaction layer, logging policy, or deletion process.
Can the team explain what data each model, tool, and vendor can access?Vendor inventory, tool permission map, model data policy, or DPIA.
D4

Observability

Traces, evals, incidents, drift, operational review, and production metrics.

Are prompts, tool calls, model versions, traces, and key decisions observable?Tracing, structured logs, request IDs, model version capture, or tool call records.
Are incidents, drift, refusals, hallucinations, and unsafe actions tracked over time?Incident log, drift monitor, eval trend, refusal analytics, or safety dashboard.
Are readiness metrics reviewed on a regular operational cadence?Weekly review, release readiness check, governance forum, or lab report.
D5

Deployment control

Versioning, release gates, canaries, rollbacks, and reproducibility.

Are prompt, model, tool, and policy changes versioned and reviewable?Git history, prompt registry, policy versioning, release notes, or approval trail.
Are eval gates, canary releases, rollbacks, or staged deployments used before production changes?CI eval gates, staging environments, canary config, rollback runbook, or release checklist.
Can the team reproduce which agent version produced a given output or action?Trace IDs, version IDs, prompt hashes, deployment metadata, or audit records.
D6

Human oversight

Autonomy limits, approvals, escalations, overrides, and audit trails.

Are high-impact or irreversible actions gated by human review?Approval workflow, policy gate, review queue, sign-off logs, or role-based permissions.
Are autonomy levels, escalation rules, and manual override paths documented?Autonomy matrix, escalation policy, override procedure, or operator handbook.
Are human approvals and interventions logged for audit and learning?Approval logs, intervention records, audit trail, or post-action review.
D7

Security posture

Tool permissions, secrets, agent threat testing, and integration risk.

Are tool permissions, secrets, credentials, and external actions least-privilege by default?Scoped tokens, permission manifests, secret scanning, isolated tools, or RBAC.
Are agent-specific threats such as prompt injection, tool abuse, and data exfiltration tested?Threat model, red-team tests, security review, or adversarial evals.
Are dependency, provider, and integration risks reviewed before production release?Dependency review, provider risk register, integration review, or security checklist.
D8

Ecosystem resilience

Provider fallbacks, dependency inventory, portability, and degraded modes.

Can the system degrade gracefully if a model, provider, tool, or API fails?Fallback model, degraded mode, retry policy, circuit breaker, or incident playbook.
Is there a documented dependency inventory for models, vendors, datasets, and tools?Dependency inventory, vendor list, model card registry, or tool manifest.
Are portability, fallback, and exit paths considered for critical capabilities?Provider abstraction, fallback plan, exit plan, or compatible API layer.

The generated report separates self-reported controls from public evidence. Formal PSF credentials require a separate review.

Why this can earn links

A public report, a transparent methodology, and a README badge give teams something specific to cite while keeping PAI positioned as the standards and lab institution.

Revenue stays downstream

The Index builds trust first. Deploy Studio, Lab reviews, credentials, and partner programs become credible only after the public standard feels real.

Public methodology

Designed for credibility, not vanity scoring.

The first release is deliberately conservative: equal PSF domain weighting, transparent answer values, explicit evidence grading, and a clear line between readiness reports and credentials.

Eight PSF domains

Input boundary, output validation, data stewardship, observability, deployment control, human oversight, security, and ecosystem resilience.

24 checks

Three focused checks per domain, scored as evidence exists, partial, not yet, or not applicable.

Repository signals

Public GitHub scans look for file-path evidence such as evals, schemas, runbooks, approvals, security policy, and fallbacks.

Not certification

The Index is a readiness report. A credential requires separate review, evidence validation, and program governance.

Read the PSFResearch library