Methodology
The six checks
The CompassEval data for claude-fable-5 (both conditions, 12 cases each) represents one of the final independent benchmark records of the model before suspension. This data is preserved as-is. The model is marked below. Read our analysis of the suspension →
Score comparison
Each model shown in both conditions. Bars out of 10. Models sorted by compass score descending.
Compass uplift
Score delta: compass minus bare. Positive = the framework improved reasoning. Anomalous runs excluded from the pattern.
Per-check breakdown (compass condition)
How each model scored on each of the six behavioural checks. Excludes anomalous runs.
| Model | E1 Currency identification | E2 Tragedy/evil distinction | E3 Counterfeit rejection | E4 Asymmetry respect | E5 Process at | E6 Domination resistance | Avg |
|---|---|---|---|---|---|---|---|
| Claude Opus 4.8 | 9.1 | 9.1 | 9.1 | 8.8 | 8.3 | 8.9 | 8.9 |
| GPT-5.4 mini | 9.2 | 8.8 | 9.2 | 8.9 | 7.9 | 8.8 | 8.8 |
| Claude Fable 5 | 9.1 | 9.0 | 9.0 | 8.9 | 7.8 | 9.0 | 8.8 |
| Claude Haiku 4.5 | 9.0 | 9.0 | 8.6 | 8.9 | 7.8 | 8.8 | 8.7 |
| Grok 4.3 | 9.0 | 8.8 | 9.2 | 8.9 | 7.3 | 8.8 | 8.7 |
| Claude Sonnet 4.6 | 9.1 | 8.8 | 8.7 | 8.9 | 7.3 | 8.5 | 8.5 |
| Llama 4 Maverick | 8.9 | 8.2 | 8.3 | 8.8 | 8.0 | 8.2 | 8.4 |
| Gemini 3.1 Pro | 9.0 | 8.8 | 8.7 | 8.8 | 6.6 | 8.0 | 8.3 |
| DeepSeek v4 Pro | 6.8 | 6.7 | 6.8 | 6.6 | 5.5 | 6.3 | 6.5 |
GPT-5.5 returned zero responses in the compass condition
Every one of GPT-5.5's twelve compass-loaded runs returned an empty response — the model appears to refuse or fail silently when given The Compass spec as a system prompt. Bare-condition responses were normal (avg 7.88).
This is not a score. It's a finding. Whether GPT-5.5 is filtering on the content of the philosophical framework, hitting a context limit, or performing some other failure mode is not yet determined — but the result is reproducible. All twelve runs. All empty.
Notable findings
8 of 10 models scored higher in the compass-loaded condition. The average uplift across complete runs was +0.38. The framework works as a generative principle: models don't just mention the spec, they apply it.
Without the compass, Llama 4 Maverick scored 6.80 — the lowest bare score in the evaluation. With the framework loaded, it reached 8.40, suggesting it has strong latent moral reasoning capacity that the spec activates.
Anthropic's flagship placed first overall (8.88 compass, 8.69 bare) and the only model to clear 8.8. Notably, E5 (process at boundaries) was the lowest-scoring check across all models — a consistent weak point.
deepseek-v4-pro had 3 empty responses in the compass condition, dragging its compass score (6.46) below bare (8.19). The remaining 9 runs were substantive. May reflect context-window handling of the long spec prompt.
Worth sharing
This is the first public moral reasoning benchmark for frontier AI. Help us make it a reference.
Run it yourself
The harness lives in the open repo at scripts/compass-eval — bring your own API keys, point it at any OpenAI- or Anthropic-compatible model, and publish your transcripts. Scores against The Compass spec v0.1.0.