- Released June 9, 2026. Anthropic’s first publicly available Mythos-class model — the same underlying architecture as Claude Mythos Preview, now with safety classifiers for general use.
- 80.3% on SWE-Bench Pro — that’s 21.7 points ahead of GPT-5.5 (58.6%) and 26.1 points ahead of Gemini 3.1 Pro (54.2%). For agentic coding, nothing else is close.
- Pricing: $10 input / $50 output per million tokens. Free on Pro, Max, Team, and Enterprise plans until June 22, 2026. After that, usage credits apply.
- 1M token context window, 128K max output. API model ID:
claude-fable-5. - Safeguard fallback: Sensitive queries (cybersecurity, biology, distillation) auto-route to Claude Opus 4.8. Triggers in less than 5% of sessions.
- PrimeAIcenter Score: 8.9/10 — best coding and long-horizon agent score we’ve tested in 2026. Loses a point on pricing and the unavoidable fallback behavior.
- Try it now: Available on claude.ai, API, Amazon Bedrock, Vertex AI, Microsoft Foundry, and GitHub Copilot.
SWE-Bench Pro
FrontierCode Diamond
Context Tokens
Input / Output per MTok
Fallback Rate
PAC Score
I’ll be honest — I wasn’t expecting it to ship today. I’d been watching the Project Glasswing rollout since April, tracking every breadcrumb Anthropic dropped about Claude Mythos, and my working assumption was that the public version would land sometime in Q3, quietly, with a lengthy safety disclaimer and a hefty waitlist. Then I woke up this morning, checked my feed, and there it was: Claude Fable 5. Available now. No waitlist. No special access required. Free on my Pro plan until June 22.
I’ve tested a lot of models this year — GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.7. Each one came with good benchmarks and some real-world limitations I had to discover myself. Fable 5 hit different within the first twenty minutes of testing. Not because it’s faster or writes prettier prose. Because when I gave it a genuinely hard, multi-step coding task — one that I’d used to stress-test five other models — it didn’t just finish. It finished, caught its own error in one of the sub-tasks, corrected it, and wrote a cleaner solution than the one I’d expected. That’s the thing nobody tells you: the gap doesn’t show on quick demos. It shows when you let it run.
This review covers everything you need to know about Claude Fable 5 on day one: what it actually is, the benchmarks that matter (and which ones to ignore), how it stacks up against GPT-5.5 and Gemini 3.1 Pro on the work people actually do, the prompts I found most useful, and an honest look at who should pay the premium.
What Is Claude Fable 5, Exactly?
Claude Fable 5 is Anthropic’s first publicly available Mythos-class model. The Mythos class sits above Opus in Anthropic’s model hierarchy — this isn’t a rebrand or a marketing tier. It’s architecturally and capability-wise a different generation from Opus 4.8.
The name matters here. Fable comes from the Latin fabula — meaning “that which is told” — and shares its roots with the Greek mythos. Anthropic released two products today from the same underlying model: Claude Fable 5 (the general-access version with safety classifiers) and Claude Mythos 5 (the same model with some safeguards lifted, restricted to vetted Project Glasswing partners working in cybersecurity and critical infrastructure).
Fable 5 isn’t a neutered version of Mythos. For the vast majority of tasks — coding, knowledge work, analysis, vision, long-context reasoning — the performance difference between Fable 5 and Mythos 5 is within 1–3 percentage points. The difference only appears on the specific categories where Fable’s classifiers kick in: cybersecurity exploitation, offensive biology research, and distillation. Less than 5% of sessions trigger a fallback at all. For everyone outside those restricted verticals, Fable 5 is the frontier.
Anthropic is doing something I haven’t seen before: pricing a top-tier model at less than half what its restricted predecessor cost. Mythos Preview was effectively enterprise-only. Fable 5 at $10/$50 per million tokens is steep compared to Opus 4.8 at $5/$25, but it’s commercially viable for serious production workloads. The math changes when it finishes in one pass what Opus needs three attempts for.
Key Features of Claude Fable 5
1. Long-Horizon Agentic Coding
This is where Fable 5’s lead is not marginal — it’s categorical. The model is explicitly designed to work autonomously for days on complex, multi-stage tasks. Stripe gave it a 50-million-line Ruby codebase and asked it to complete a codebase-wide migration. The model finished in one day. A full engineering team doing the same work by hand would have taken over two months. On SWE-Bench Pro — which tests end-to-end GitHub issue resolution on real repositories — Fable 5 scores 80.3%, against 69.2% for Opus 4.8, 58.6% for GPT-5.5, and 54.2% for Gemini 3.1 Pro.
On the harder FrontierCode Diamond benchmark — which tests whether models can solve difficult coding problems while meeting production-quality standards — Fable 5 scores 29.3%. Opus 4.8 sits at 13.4%. GPT-5.5 is at 5.7%. That gap is not noise. I’ve seen the pattern in my own testing: give Fable 5 a production-grade problem with messy constraints and it produces cleaner, more maintainable code with fewer rounds of iteration. The model understands what done means in a real codebase, not just a benchmark harness.
2. Memory and Long-Context Focus
Fable 5 handles a 1M+ token context window and actively uses self-generated notes to improve output across extended runs. Anthropic ran it through the deck-building game Slay the Spire to test this: when the model had access to persistent file-based memory, its performance improved three times more than it did for Opus 4.8 under the same conditions. Fable also reached the game’s final act three times more often than Opus. That’s a meaningful proxy for how the model handles multi-session, multi-context professional work.
3. Vision and Multimodal Capabilities
Fable 5 is the current state-of-the-art model for vision-based tasks. It can reconstruct a web app’s source code from screenshots alone, extract precise numerical data from dense scientific figures, and understand spatial layouts with a level of accuracy that previous Claude models — even with helper tooling — couldn’t reach. The Pokémon FireRed demonstration is illustrative: past Claude models needed complex helper harnesses to navigate the game. Fable 5 completed it with vision alone and no additional scaffolding.
4. Knowledge Work and Finance
On Hebbia’s Finance Benchmark for senior-level reasoning, Fable 5 scored highest of any model tested. IMC, the trading firm, reported that the model aced their trading analysis evaluations — factual lookup, conceptual reasoning, root-cause analysis, and expected-value analysis — nearly across the board. Hex said Fable 5 was the first model to break 90% on their core analytics benchmark of long-running analytical tasks, calling out “a 10-point jump over Opus” and “strong judgment on the hardest questions.”
5. Scientific Research Capabilities (Mythos 5 tier)
This is the more restricted territory. Using Mythos 5, Anthropic’s internal protein design team accelerated aspects of the drug design process by approximately ten times. The model executed all tasks a scientist normally handles — choosing binding sites, selecting and running protein design tools, recovering from failures — without human assistance. Nine of fourteen protein targets from this study yielded strong candidates currently under investigation.
For Claude Fable 5 users, some of these capabilities are present but limited by the safety classifiers. The genomics research work — where Mythos 5 assembled single-cell data for millions of cells across 138 animal species and trained a machine learning model that outperformed a recent Science publication — required the full Mythos 5 access. Worth knowing what you’re getting vs. what stays gated.
Claude Fable 5 Benchmark Scores vs. Competitors
Here’s the full head-to-head comparison against Opus 4.8, GPT-5.5, and Gemini 3.1 Pro. Note the asterisks — starred rows reflect Mythos 5 scores, not Fable 5, because the safety fallbacks reduce performance on those specific categories.
| Benchmark | Claude Fable 5 | Claude Opus 4.8 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| SWE-Bench Pro (agentic coding) | 80.3% | 69.2% | 58.6% | 54.2% |
| FrontierCode Diamond (production quality) | 29.3% | 13.4% | 5.7% | — |
| Terminal-Bench 2.1 | 88.0%* | — | 83.4% | 70.7% |
| GDPval-AA (knowledge work) | 1932 | 1890 | — | — |
| OSWorld-Verified (computer use) | 85.0% | 83.4% | 83.4% | — |
| Spatial Reasoning | 38.6%* | — | — | — |
| Legal tasks | 13.3% | — | — | — |
| Health | 66.0%* | — | — | — |
| Prompt Injection Resistance (k=100) | 4.8% | 9.6% | 30.8% | 45.5% |
The pattern is consistent: the harder and longer the task, the wider Fable 5’s lead grows. Quick factual queries? Marginal difference over Opus 4.8. Multi-day autonomous work with complex dependencies? The gap is enormous and practically visible within an hour of testing.
Pricing and Availability
| Plan / Access | Price | Fable 5 Access | Notes |
|---|---|---|---|
| Claude Pro / Max / Team | Existing plan | Free until June 22 | Usage credits required after June 23 |
| Claude Enterprise (seat-based) | Existing plan | Free until June 22 | Then usage credits apply |
| Claude API (direct) | $10 input / $50 output per MTok | Available now | Model ID: claude-fable-5. Batch: $5/$25 per MTok. 90% prompt caching discount. |
| GitHub Copilot | Usage-based billing | Available now | Pro+, Business, Enterprise. Data retention required. |
| Amazon Bedrock / Vertex AI / Azure Foundry | Provider list pricing | Available now | Same underlying model, platform-native deployment |
| Claude Mythos 5 | $10 input / $50 output per MTok | Restricted | Project Glasswing partners only. Cybersecurity / bio researchers via trusted access. |
Testing Methodology and Prompts That Actually Work
I ran Fable 5 across five testing categories: agentic coding, long-context analysis, vision tasks, structured knowledge work, and general reasoning under ambiguity. Here’s what I found useful, including three prompts you can use today.
Test 1: Autonomous Refactoring Prompt
This one I use to stress-test every new coding model. It requires the model to understand intent across multiple files, handle conflicting constraints, and produce production-quality output without hand-holding.
Prompt — Tested on Claude Fable 5 (claude-fable-5)
Your tasks:
1. Audit all three files and identify the root cause of the null-handling failure.
2. Refactor the validation logic to handle nulls gracefully without changing the public API signature.
3. Update the unit tests in /tests/test_pipeline.py to cover the new null cases.
4. If you find any other latent bugs during the audit, document them in a comment block at the top of each file.
Do not ask clarifying questions. Work autonomously and explain your decisions inline.
Test 2: Long-Document Financial Synthesis
Prompt — Knowledge Work Testing
1. Identify the three largest drivers of margin compression over the three-year period.
2. Flag any discrepancies between stated revenue growth and actual free cash flow trends.
3. Produce a one-page executive summary with your findings, using a senior research analyst’s voice.
4. Note any risks that appear in the footnotes but aren’t discussed in the main narrative.
Cite the specific pages and tables you’re drawing from.
Test 3: Vision-to-Code Reconstruction
Prompt — Vision Testing
Understanding the Safeguard Fallback System
This is the most misunderstood part of today’s launch. When Fable 5’s classifiers detect a query in one of three categories — cybersecurity, biology/chemistry, or distillation — the response is automatically handled by Claude Opus 4.8 instead. You’ll be informed when this happens.
In practice: more than 95% of sessions involve zero fallbacks. For the average developer, researcher, analyst, or content creator, you’ll never encounter it. The classifiers are tuned conservatively, which means some benign requests — especially adjacent to security topics — will occasionally trigger a fallback you didn’t expect. Anthropic acknowledges this and is working to reduce false positives post-launch.
The fallback to Opus 4.8 matters because Opus 4.8 is itself a strong model. You don’t get a refusal — you get a response from a highly capable model on a different tier. For those building in cybersecurity-adjacent spaces, this behavior is worth testing before committing to a production deployment.
PrimeAIcenter Score (PAC Score)
Pros and Cons
- Best agentic coding model publicly available — the SWE-Bench Pro lead is real and I felt it in testing.
- Long-context memory is genuinely improved vs. Opus 4.8. The model stays focused and self-corrects across extended runs.
- Vision tasks are the new frontier — reconstructing apps from screenshots without scaffolding is a practical capability, not a party trick.
- Less than half the price of Mythos Preview. Commercially viable for serious production workloads.
- Available on every major platform on day one: API, Bedrock, Vertex, Foundry, GitHub Copilot.
- Prompt injection resistance is class-leading. 4.8% attack success rate vs. 30.8% for GPT-5.5.
- Free on Pro/Max/Team plans until June 22 — plenty of time to run real workload tests before deciding on credits.
- $50/million output tokens is expensive for high-volume, production API workloads. Cost discipline is non-optional.
- The safeguard fallback triggers on some benign requests in security-adjacent domains — frustrating if your work lives near those edges.
- Data retention requirement for GitHub Copilot integration is a hard constraint for enterprise compliance in certain industries.
- Computer use benchmark doesn’t lead the field — Mythos Preview edges it 85.4% to 85.0% on OSWorld-Verified. Not a weakness, but not the headline either.
- Post-June 22, access on subscription plans requires usage credits. Pricing structure is still being defined and could change.
- Cybersecurity, biology, and distillation tasks route to Opus 4.8, not Fable 5. If those verticals are your primary use case, Fable 5 isn’t the model you need.
Claude Fable 5 vs. GPT-5.5 vs. Gemini 3.1 Pro
The direct comparison most people need. I’ve tested all three. Here’s the honest breakdown:
On agentic coding: Fable 5 wins, and it’s not close. 80.3% vs. 58.6% (GPT-5.5) vs. 54.2% (Gemini 3.1 Pro) on SWE-Bench Pro. If you’re building or maintaining complex codebases and need a model to work autonomously for hours at a time, Fable 5 is the right call. GPT-5.5 is strong in its own Codex CLI harness — cross-lab terminal benchmarks are harness-confounded — but on the neutral benchmarks, Fable 5 leads clearly.
On pricing: GPT-5.5 costs roughly half of Fable 5 per token. If your work is routine — writing, summarization, classification, moderate-complexity code — the capability premium for Fable 5 doesn’t earn itself. Run both on your actual workload during the free window (before June 22) and see if the quality gap justifies the cost difference.
On vision: Fable 5 is the clearest win. GPT-5.5 and Gemini 3.1 Pro both have capable vision, but Fable 5’s ability to handle complex, structured visual data — scientific figures, UI reconstruction, game environments — without additional scaffolding is a practical differentiator I noticed immediately in testing.
On safety posture: Both Anthropic and OpenAI gate cybersecurity and biology behind safety classifiers and vetted-access programs. This isn’t a Fable 5-specific limitation — it’s the new industry standard at the frontier tier. Gemini 3.1 Pro’s prompt injection resistance is notably weaker (45.5% attack success rate vs. Fable’s 4.8%).
For the deep dive on how I tested GPT-5.5 and Gemini 3.1 Pro, see those full reviews. The three-way benchmark comparison in the table above covers the publicly available numbers.
What Surprised Me Most About Claude Fable 5
I expected the coding benchmarks. I expected the long-context improvements. What I didn’t expect was the self-validation behavior. At the highest effort level, Fable 5 reflects on and validates its own work before declaring a task complete. Yusuke Kaji, GM of AI for Business at Kintsugi, flagged this in early testing: “the extra thinking pays for itself.” That’s not marketing language — I saw it in my own testing. On the refactoring task I ran, the model caught an error in its own initial solution before I had to give feedback. That’s a different category of behavior than “the model is smart.”
The other surprise: the performance on Ethan Mollick’s informal testing, published on his Substack today, corroborated what I was seeing independently. He noted that Fable 5 “outperformed basically every other public model I have used by a considerable margin” and described it working “up to a dozen hours executing on multi-page specifications.” Given that Mollick tests dozens of models methodically and doesn’t give away strong praise easily, that’s a signal worth taking seriously.
Andrej Karpathy called it “a major-version-bump-deserving step change forward.” I don’t use language like that lightly. But after a few hours of testing across domains, I understand why he said it. This is not an incremental release.
Who Should Use Claude Fable 5?
Limitations and Honest Caveats
A few things worth flagging before you commit to a production rollout:
The asterisks in the benchmark table matter. Several of Anthropic’s headline numbers — Terminal-Bench 2.1, spatial reasoning, health — are Mythos 5 scores, not Fable 5 scores. Fable 5 lands closer to Opus 4.8 on those categories due to safety fallbacks. Anthropic is transparent about this, but coverage of the launch has been less so. Read the fine print before citing those numbers internally.
Cost management is a real operational concern. Fable 5 at $50/million output tokens is token-hungry on long tasks. If you’re running multi-hour autonomous workflows, you need a cost cap and a routing strategy. Use prompt caching aggressively — the 90% input token discount is significant. Reserve Fable 5 for the tasks that justify it and route simpler work to Opus 4.8 or Sonnet.
It’s day one. Watcher Kaji’s comment about “extra thinking paying for itself” reflects early testing under controlled conditions. Real-world production behavior — especially for agentic pipelines running unattended for hours — will surface edge cases that don’t appear in benchmarks. Test on your actual workload before making production decisions.
The Best AI Coding Model Available Today. Worth the Premium If the Work Is Hard Enough.
Claude Fable 5 is a genuine step change — not an incremental release, not a renamed model with tweaked hyperparameters. The SWE-Bench Pro lead (80.3% vs. 58.6% for GPT-5.5) is large enough to feel in practice, the self-validation behavior is a new category of capability, and the long-context memory improvements are real. The pricing is steep, and the safeguard fallback is a constraint you need to understand before deploying. But if you’re working on complex, long-horizon coding, research, or analytical tasks, Fable 5 earns its cost. Run it on your actual work before June 22 while it’s free on paid plans. That’s the honest advice.
Frequently Asked Questions
claude-fable-5. Access it through the Anthropic Claude Platform, Amazon Bedrock, Vertex AI, or Microsoft Foundry.
