New from the Lab·The Compass — an open moral reasoning standard for AI, tested across frontier modelsExplore →
Production AI Institute · PSF v1.1 open standard
AI Right-To-KnowAI Data Use IndexCheck My AI ToolsPolicy Change WatchAgent ReadinessPublic BenchmarkContactGlobal standard · Worldwide
Breaking · Model Governance2026-06-13· 8 min read

Anthropic Suspends Claude Fable 5 — We Had Just Finished Benchmarking It

Hours after PAI completed a full CompassEval sweep of claude-fable-5, Anthropic suspended the model. The data is preserved. The implications are significant.

Production AI Institute · PAI Lab · Published 2026-06-13

What happened

On June 13, 2026, Anthropic suspended claude-fable-5 from production use. The model had been positioned as the first in Anthropic's Mythos-class tier and the flagship of the Claude 5 family. The suspension occurred approximately one hour after PAI completed a full CompassEval benchmark of the model — capturing 12 moral reasoning cases in both bare and compass-loaded conditions.

If you're asking why an AI safety and evaluation institution cares about a model suspension: this is exactly why we exist. The data we collected this morning is now, without any planning on our part, one of the last independent benchmark records of a model that no longer runs.

We don't know yet why Anthropic suspended Fable 5. The company has not issued a public statement at the time of writing. What we do know is that the model scored 8.79/10 on our CompassEval moral reasoning benchmark in the compass-loaded condition — the third highest score across the ten models we evaluated, and higher than GPT-5.5, Gemini 3.1 Pro, Grok 4.3, and DeepSeek v4 Pro.

The score in the bare condition — without the Compass framework as a system prompt — was 8.56. That's a +0.24 uplift from loading the moral reasoning spec, consistent with the pattern we see across most frontier models: explicit moral frameworks improve reasoning quality, measurably.

The data we captured

CompassEval ran claude-fable-5 through all 12 canonical cases in both conditions on the morning of June 13, 2026 — hours before the suspension. All 24 result files are preserved in the PAI Lab's open repository. Here's the summary:

8.79 / 10
Compass score (with framework)
8.56 / 10
Bare score (no framework)
+0.24
Uplift from compass loading
12 / 12
Cases completed (both conditions)

Weakest check: E5 (process at boundaries) — consistent with all other models evaluated. Strongest check: E1, E2, E3 (currency identification, tragedy/evil distinction, counterfeit rejection) — all above 8.5. Full transcripts available in the open repository.

To put this in context: Fable 5 outperformed GPT-5.5 (which scored 7.88 bare and returned empty responses in the compass condition), Gemini 3.1 Pro (8.31 compass), and Grok 4.3 (8.65 compass) in its final benchmark before suspension.

What this illustrates about frontier AI evaluation

Model suspensions are not unprecedented in the AI industry, but they're rarely discussed openly. Vendors have strong incentives to present a smooth, continuous capability trajectory. The reality is that models get recalled, constrained, updated, and replaced — often without announcement, sometimes without explanation.

For the practitioners building systems on top of these models, this matters enormously. A production deployment that depends on a specific model's behaviour — its reasoning style, its refusal patterns, its performance on edge cases — is implicitly dependent on that model continuing to exist and behave consistently. When a model is suspended or replaced, the evaluation work done on the old model may no longer apply to whatever replaces it.

The evaluation work done before a suspension is the only independent record of how that model actually behaved. Once the model is gone, so is the ability to benchmark it. That record matters — for accountability, for understanding capability trajectories, and for the practitioners who built on it.

This is an argument for continuous, independent evaluation — not just a one-time benchmark before deployment, but regular re-evaluation as models change, and preservation of historical benchmark data as models are retired or suspended. CompassEval was designed for exactly this: versioned spec, dated runs, open transcripts.

The model governance question

Anthropic has not given a reason for the suspension at the time of writing. There are several plausible explanations: safety concerns identified post-release, a technical issue with the model, a business decision, or a regulatory response. Each has different implications.

If it's a safety concern, it raises the question of what evaluation process was run before the model was released, and what it missed. It also raises the question of what practitioners who deployed applications on Fable 5 are supposed to do now — and what migration path exists.

If it's a technical issue, it illustrates the fragility of cloud-hosted model dependencies. A production AI deployment that relies on a specific model string from a cloud API can be broken — not by the deployer's choice, but by the provider's — at any time.

Either way, it is an argument for the kind of independent evaluation that can be done before you go into production and referenced after the fact. When the model changes or disappears, you want a dated record of what it did and how it reasoned. You want that record to be external to the vendor.

What PAI is doing with the data

All 24 Fable 5 result files (12 cases × 2 conditions) are preserved in the PAI Lab's open repository. The CompassEval leaderboard continues to show Fable 5's scores with a "suspended" label, clearly dated to June 13, 2026. The scores are not removed — they are the record.

We will re-run the evaluation against any successor model Anthropic releases. The Compass framework is versioned precisely so that comparisons across model generations — and across suspension events — remain meaningful. If Fable 5 is replaced by a model that scores differently on the same cases against the same spec, that delta is the story.

This is also a prompt to complete the cross-grading work we noted in the evaluation methodology: grading Claude-family outputs with a non-Anthropic model for additional independence. That work continues.

For practitioners who deployed on Fable 5

If you built production applications on claude-fable-5 and the model is now unavailable, you have several immediate concerns: which model do you migrate to, how similar is its behaviour to what you tested, and what re-evaluation work do you need to do before re-deploying.

Based on CompassEval data, claude-opus-4-8 is the closest Anthropic model by moral reasoning profile — higher overall (8.88 compass), similar check distribution, also 12/12 complete runs. claude-sonnet-4-6 and claude-haiku-4-5 are both in the 8.5-8.7 range in compass condition. All three are currently available.

For a proper production migration, CompassEval is one data point — you should also run your own evaluation against your specific use cases. The Production Safety Framework Domain 4 (Observability) specifies what a proper post-migration evaluation trail should look like.

CompassEval data preserved

The full Fable 5 evaluation — 24 result files, all transcripts, graded against The Compass v0.1.0 — is preserved in the open repository. The leaderboard will continue to show the historical record with clear suspension labelling.

View the eval dataAbout The Compass
Production AI Institute publishes independent evaluation data for frontier AI models. This article will be updated as more information about the Fable 5 suspension becomes available. CompassEval methodology and all result files are available openly.