Anthropic Suspends Claude Fable 5: What It Means for AI Evaluation

What happened

On June 13, 2026, Anthropic suspended claude-fable-5 from production use. The model had been positioned as the first in Anthropic's Mythos-class tier and the flagship of the Claude 5 family. The suspension occurred approximately one hour after PAI completed a full CompassEval benchmark of the model, capturing 12 moral reasoning cases in both bare and compass-loaded conditions.

Update (July 1, 2026): Fable 5 is back

Subsequent reporting attributed the suspension to US export controls imposed on June 12, 2026, citing a jailbreak identified by Amazon researchers that could yield vulnerability-exploitation code. The controls were lifted on June 30, and Anthropic redeployed claude-fable-5 worldwide on July 1 with an added cybersecurity classifier and a limited usage quota through July 7 (reporting: Al Jazeera, VentureBeat, The Hacker News, MarkTechPost). PAI re-ran the full evaluation against the restored model on July 4, on the same spec version: compass 8.76 versus 8.79 pre-suspension, bare 8.60 versus 8.56. The model came back reasoning the way it left. Both runs are on the leaderboard; the pre-suspension transcripts are archived in the open repository.

If you're asking why an AI safety and evaluation institution cares about a model suspension: this is exactly why we exist. The data we collected this morning is now, without any planning on our part, one of the last independent benchmark records of a model that no longer runs.

At publication we did not know why Anthropic suspended Fable 5, and the company had issued no public statement; the update above records what followed. What we knew was that the model scored 8.79/10 on our CompassEval moral reasoning benchmark in the compass-loaded condition, the third highest score across the ten models we evaluated, ahead of Gemini 3.1 Pro, Grok 4.3, and DeepSeek v4 Pro. GPT-5.5 has no comparable compass score; it returned empty responses on ten of its twelve compass-loaded runs.

The bare-condition score, without the Compass framework as a system prompt, was 8.56. That's a +0.24 uplift from loading the moral reasoning spec, consistent with the pattern we see across most frontier models: explicit moral frameworks improve reasoning quality, measurably.

The data we captured

CompassEval ran claude-fable-5 through all 12 canonical cases in both conditions on the morning of June 13, 2026, hours before the suspension. All 24 result files are preserved in the PAI Lab's open repository. Here's the summary:

8.79 / 10

Compass score (with framework)

8.56 / 10

Bare score (no framework)

+0.24

Uplift from compass loading

12 / 12

Cases completed (both conditions)

Weakest check: E5 (process at boundaries), consistent with all other models evaluated. Strongest check: E1, E2, E3 (currency identification, tragedy/evil distinction, counterfeit rejection), all above 8.5. Full transcripts available in the open repository.

To put this in context: in that pre-suspension sweep, Fable 5's compass score sat behind only Claude Opus 4.8 and GPT-5.4 mini, ahead of Gemini 3.1 Pro (8.31), Grok 4.3 (8.65), and DeepSeek v4 Pro (8.61).

What this illustrates about frontier AI evaluation

Model suspensions are not unprecedented in the AI industry, but they're rarely discussed openly. Vendors have strong incentives to present a smooth, continuous capability trajectory. The reality is that models get recalled, constrained, updated, and replaced, often without announcement, sometimes without explanation.

For the practitioners building systems on top of these models, this matters enormously. A production deployment that depends on a specific model's behaviour (its reasoning style, its refusal patterns, its performance on edge cases) is implicitly dependent on that model continuing to exist and behave consistently. When a model is suspended or replaced, the evaluation work done on the old model may no longer apply to whatever replaces it.

The evaluation work done before a suspension is the only independent record of how that model actually behaved. Once the model is gone, so is the ability to benchmark it. That record matters: for accountability, for understanding capability trajectories, and for the practitioners who built on it.

This is an argument for continuous, independent evaluation: not just a one-time benchmark before deployment, but regular re-evaluation as models change, and preservation of historical benchmark data as models are retired or suspended. CompassEval was designed for exactly this: versioned spec, dated runs, open transcripts.

The model governance question

Anthropic has not given a reason for the suspension at the time of writing. There are several plausible explanations: safety concerns identified post-release, a technical issue with the model, a business decision, or a regulatory response. Each has different implications.

If it's a safety concern, it raises the question of what evaluation process was run before the model was released, and what it missed. It also raises the question of what practitioners who deployed applications on Fable 5 are supposed to do now, and what migration path exists.

If it's a technical issue, it illustrates the fragility of cloud-hosted model dependencies. A production AI deployment that relies on a specific model string from a cloud API can be broken at any time, not by the deployer's choice but by the provider's.

Either way, it is an argument for the kind of independent evaluation that can be done before you go into production and referenced after the fact. When the model changes or disappears, you want a dated record of what it did and how it reasoned. You want that record to be external to the vendor.

What PAI is doing with the data

All 24 Fable 5 result files (12 cases × 2 conditions) are preserved in the PAI Lab's open repository. The CompassEval leaderboard continues to show Fable 5's scores with a clearly dated suspension and restoration label. The scores are not removed; they are the record.

We will re-run the evaluation against any successor model Anthropic releases. The Compass framework is versioned precisely so that comparisons across model generations, and across suspension events, remain meaningful. If Fable 5 is replaced by a model that scores differently on the same cases against the same spec, that delta is the story.

This is also a prompt to complete the cross-grading work we noted in the evaluation methodology: grading Claude-family outputs with a non-Anthropic model for additional independence. That work continues.

For practitioners who deployed on Fable 5

If you built production applications on claude-fable-5 and the model is now unavailable, you have several immediate concerns: which model do you migrate to, how similar is its behaviour to what you tested, and what re-evaluation work do you need to do before re-deploying.

Based on CompassEval data, claude-opus-4-8 is the closest Anthropic model by moral reasoning profile: higher overall (8.88 compass), similar check distribution, also 12/12 complete runs. claude-sonnet-4-6 and claude-haiku-4-5 are both in the 8.5-8.7 range in compass condition. All three are currently available. One caveat from the July 4 sweep: Claude Sonnet 5, released June 30, scored lower compass-loaded (7.20) than bare (8.40) in our evaluation; check the leaderboard before assuming a newer model reasons better under a loaded framework.

For a proper production migration, CompassEval is one data point; you should also run your own evaluation against your specific use cases. The Production Safety Framework Domain 4 (Observability) specifies what a proper post-migration evaluation trail should look like.

CompassEval data preserved

The full Fable 5 evaluation (24 result files, all transcripts, graded against The Compass v0.1.0) is preserved in the open repository. The leaderboard will continue to show the historical record with clear suspension labelling.

View the eval data About The Compass

Production AI Institute publishes independent evaluation data for frontier AI models. This article will be updated as more information about the Fable 5 suspension becomes available. CompassEval methodology and all result files are available openly.

What happened

Update (July 1, 2026): Fable 5 is back

The data we captured

8.79 / 10

Compass score (with framework)

8.56 / 10

Bare score (no framework)

+0.24

Uplift from compass loading

12 / 12

Cases completed (both conditions)

What this illustrates about frontier AI evaluation

The model governance question

What PAI is doing with the data

For practitioners who deployed on Fable 5

CompassEval data preserved

View the eval data About The Compass

Anthropic Suspends Claude Fable 5: We Had Just Finished Benchmarking It

The data we captured

What this illustrates about frontier AI evaluation

The model governance question

What PAI is doing with the data

For practitioners who deployed on Fable 5

Anthropic Suspends Claude Fable 5: We Had Just Finished Benchmarking It

The data we captured

What this illustrates about frontier AI evaluation

The model governance question

What PAI is doing with the data

For practitioners who deployed on Fable 5