jmv.dev
m↓claude-fable-5-mythos-review.md×
‹ back to blog
--- title: "Claude Fable 5 (Mythos): Impressive, and Concerning" date: 2026-06-11 tags: [claude-fable-5, mythos, anthropic, model-review, opinion] reading: 15 min ---

Claude Fable 5 (Mythos): Impressive, and Concerning

The first Claude Mythos class model just shipped to the public. Claude Fable 5 is out, and I have spent the last two days reviewing it. The results are in, and they are impressive, both in a good way and in a genuinely concerning way. This is going to cover coding, design, research, how the thing actually feels to use, because it feels very different, how much it costs, which is a lot, and the broader worries this release is already raising.

A bit of context first, because the rollout matters as much as the model. Anthropic hyped this one harder than I have ever seen them hype anything. There was Project Glass Wing, where the model was handed to a select set of companies before the public got it. And do not let the naming throw you: Fable is Mythos, just with some safety safeguards built in on top. With that much buildup I half expected the model to underdeliver. It did not. If anything it cleared the bar.

The benchmarks, with the usual caveats

Anthropic's own line is that Fable 5's capabilities exceed anything they have made generally available, state of the art on nearly all benchmarks, strong across engineering, knowledge, vision, and scientific research. Take a lab grading its own homework with a grain of salt, obviously. Some of the margins they put up, a 70 to 58 lead over GPT 5.5 in agentic coding for example, I find genuinely hard to believe at face value. But the through-line is real enough: this thing looks state of the art across basically everything.

The one that caught my eye was spatial reasoning. OpenAI has quietly owned that category for a long time, and here Fable tops even their best. Hold that thought, because it is what sent me down the next rabbit hole.

The third-party numbers tell the same story with less spin. On the Artificial Analysis intelligence index, Fable 5 is first, which I expected. What I did not expect was the size of the gap. It is about a five point jump over GPT 5.5x, which is the same distance as GPT 5x high to Minimax M3, and wider than the gap between Gemini 3.1 Pro and 5.5x high. That scale is the headline for me, not the ranking. The price is its own headline, roughly double 5.5xi, which is a little insane, and we will get there.

It is not on DeepSWE yet, which is a shame, because that has been my favorite benchmark and the one that tracks my real-world experience best. But CursorBench 3.1 is a clean sweep: the top four spots all go to Fable, from max reasoning all the way down to medium. The next model down is Opus 4.7, neck and neck with GPT 5.5 extra high, and amusingly Opus 4.7 lands above Opus 4.8. The distance between those runners-up and Fable is large.

One more capability flex worth mentioning. Anthropic had the model play Pokémon Fire Red with vision only, no elaborate harness, and it beat the game in about 50 hours. If you have watched older models attempt Pokémon, they were all over the place and needed increasingly complicated scaffolding to get anywhere. Vision only, start to finish, is a real leap. It also looks strong on alignment, which matters a lot more when the model is this capable.

So I asked it to build a game

If it is genuinely good at coding and, for the first time, genuinely good at spatial reasoning, what better stress test than a strategy game. If you went on X around launch you saw the wave of it: 3D worlds in three.js, CAD prototypes, all kinds of impressive demos. I wanted to push past a browser toy. I asked it to use Godot, the engine behind Slay the Spire 2, and build a real game with a real engine from scratch.

The brief was deliberately ambitious, and deliberately light on technical direction. I did not want to make the stack choices. I wanted to see where it would take me. Roughly:

Build me an RTS with three civilizations based on the Warcraft 3 engine, but with classical antiquity as the backdrop. Romans, Gauls, and Carthaginians as the first civs. Heroes like in Warcraft 3, similar economy and micro and macro mechanics to the Blizzard engine. Use whatever tech stack is best. It must run well on my Mac so I can play it. Keep going until it is polished and ready. Use whatever gets the best assets, create them yourself when needed. No limits. Surprise me. This needs to be AAA quality, better than Age of Empires 4. Mac native, polished 3D graphics, no cutting corners. Heroes must be historical characters and the lore has to be accurate. No half-assed indie looking. AAA quality that would make Lord Gaben jelly.

That is the whole point of the test. Almost no technical input, just a vision and a very high bar. Here is what came back.

A full main menu, a how-to-play screen, settings, and a soundtrack that Fable composed itself. Three playable civilizations: the Roman Republic, the Gaelic tribes, and the Carthaginian Empire. It built the AI, the engine, the armor types and attack types and the multiplier tables between them. The whole thing took about two and a half hours.

In-game you have workers you can send to a gold mine, rally points, houses to raise your population cap, the gold and timber economy, units, and heroes that gain experience, level up, and unlock abilities. There is fog of war. There are creeps to farm, just like Warcraft 3, so you can level your heroes off them. I gave it zero assets. Every model, every texture, the music, all of it came from the prompt above.

Would I publish it to Steam? No. But that is not the point. This is a huge jump, and not only in raw execution. Maybe, with a lot of careful prompting, I could coax something comparable out of GPT 5.5. Maybe. But look at the brief again. The thing that stands out is how little steering it needed to understand the assignment and just go do it.

The UI test

I had Fable design a landing page and a SaaS dashboard, and I lined them up against the same test I ran on Opus 4.8. Both done in Claude Code so the harness is held constant.

The Fable landing page is a clear step up. The layout is actually right. It does not read as generic or vibe-coded the way the Opus one did. It still has its tells, some of the icons and sections feel a little stock, and the gradients drifted to green instead of the usual purple, but the spacing, the structure, the craft, all noticeably better.

For contrast, the Opus 4.8 version has a messier layout, clumped sections with no breathing room, inconsistent details like some cards carrying timestamps and others not. It is not bad, exactly. It is just more generic, less considered, less finished.

I had it generate five styles. The purple gradients are back in one of them. Another is animated and genuinely polished. The editorial style is probably the weakest of the set, not quite right for a landing page, but even that one is cleanly executed. There is a pop style that is not my taste at all but is very well done, and a final very clean one that holds up. The takeaway across all five is that they are close to shippable. You will still have your own taste and your own rounds of back and forth, but as a first pass that is not embarrassing to put in front of someone, Fable gets there in roughly one prompt.

The dashboard told the same story, more starkly. I gave both models the same loose prompt, just build me a SaaS dashboard, as much freedom as possible. Opus mostly built the overview page and left it there, with a basic sidebar and decent but unremarkable taste. Fable built every page. Every menu, every separator, the toggles, the navigation, almost all of it actually works. More thorough, with less direction.

That is the pattern worth naming. The trajectory is clearly toward needing less and less technical input to get something usable. Usable is not the same as exactly what is in your head, and I want to be precise about that. It is not reaching into your brain and rendering your vision pixel for pixel. But it will ship something that works and does the job, with very little technical hand-holding. That is the core breakthrough of the Mythos class: far less steering to get to something close to a final product.

What it is actually like to use

I also put Fable through real software engineering tasks across a few of my projects, and this is where I came away most impressed. The highlight is not even the raw intelligence. It is the ability to understand things you used to have to explain.

My previous go-to was GPT 5.5. It gets things right in one shot very often, but it asks for careful prompting, steering, and explaining, sometimes to a fairly technical degree, so that it does not drift from what you wanted. It also runs well in loops and self-improvement cycles if you set those up with the right prompts. Fable needs none of that. You describe a rough idea, and it identifies the right questions on its own, answers them, and brings the idea to life. Not a rough draft. Your polished version.

It is extremely thorough, and it made it obvious to me why Dario recently said that not just coding but end-to-end software engineering is going away. Fable does not hand you a scrappy prototype and stop. It goes all the way. It is resourceful in a way that surprised me, building its own way to test the app, to check the UI, to find bugs, without being told to. It understands what it built, launches it, runs it, and flags UI issues itself, with almost no steering.

This reframed something for me. Anthropic has been saying their engineers just build agent loops now, and when I read those posts I assumed they meant orchestrating loops inside something like Hermes or OpenClaw, or some homegrown harness around Claude Code. It is none of that. It is just Claude Code. They have been running this model internally for a while, and it needs very little orchestration. It spins up subagents the way a team lead delegates, it stays on track for a long time, and the degradation as the context window fills up is much gentler than with Opus.

And do not read this as only being about thoroughness. The model is noticeably smarter than Opus or GPT 5.5, and you can feel it. I handed it a pile of hard problems from hobby projects, the kind where I had given up because they were taking too much time and steering to be worth it, and Fable just gets them. So far there is basically nothing it could not solve, and where it did not nail it on the first try, one or two more rounds got there.

Research

I am not a researcher, but I wanted to see how it handles investigating a topic and pulling together real information. My go-to for that has been the GPT Pro models since the o1 days. Anthropic had never shown me anything at that level of reasoning on genuinely hard problems, until now.

Fable's unlock here is, again, asking the right questions. Elon Musk likes to talk about this, a lesson he took from The Hitchhiker's Guide to the Galaxy: the hard part of finding the meaning of life is not being smart enough to answer a difficult question, it is knowing which question to ask. That is a clean analogy for what Fable does. It figures out the right questions and goes and answers them without you steering it there.

The test I ran was simple. I told it to use a TrustMMR API key, analyze startup data, and find the best startup idea for me assuming I wanted to sell it through Facebook ads, Reddit ads, or influencers. I added those constraints on purpose, to see how far it would dig into how the TrustMMR companies were actually marketing themselves. It scraped the data, then went deep with multiple subagents each chasing the most promising angles. It came back with what it thought were the best ideas, but flagged that it could not find Facebook or TikTok ad activity. I pushed back and asked whether that was not itself a red flag. So it pressure-tested its own ideas and killed several of them on fairly in-depth grounds. One example: it had suggested building a third-party app for Vinted, then discovered on its own that Vinted has been cracking down on exactly those apps and that the API, while it exists, probably would not be open to me. Figuring that out manually would have cost me real time.

That is the shift. The model anticipated the problem and dug deeper without being prompted to. Before, you needed enough experience in the domain to guess where the landmines were, and then you had to ask it to go check each one. It is a basic test as research goes, but the principle holds, and I think we are about to see advances in physics, medicine, chemistry, and pharma at a pace we have not seen in a very long time, maybe ever. It also makes me wonder what it would be like if OpenAI let us run the Pro models inside Codex. Maybe they will. Maybe they will not, especially with Anthropic about to pull Fable out of the subscriptions.

Now the cost

This is the bad part of the conversation. I built a little tool that totals the token cost of a project's sessions and compares it to what it would run at API prices. Here is what the review cost me.

What I built Approx. API cost
The Godot RTS (about a 2.5h session) ~$430
Five landing pages ~$20
The SaaS dashboard ~$30
The cost-analysis CLI tool itself ~$15
The startup research run ~$60

The game at $430 for a couple of hours is the eye-watering one, but look at the cost-analysis tool: $15 for a basic CLI utility. Caching is clearly doing a lot of the heavy lifting and it is still expensive. The research run breaks down as about $20 on the main loop and $37 on subagents, which is the shape of everything with this model: very good, very thorough, very token-hungry.

A warning on ultracode. I did not use it for most of this, on purpose. It blows up your usage. I tried it early and stopped, because it burns tokens like crazy. The thing about Mythos is that it works like a grand orchestrator, understanding the full scope of a task and spinning up subagents to handle each piece really well, which is exactly why it is so effective and exactly why it is so expensive.

On the 20x Max subscription I was able to run all of these tests and landed around 70 to 80 percent of the weekly limit. Before I upgraded I was on the 5x, used ultracode, and watched it vanish in five to ten minutes. Be careful.

And here is the worst part. From launch through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost. On June 23, they remove it from those plans, and using it after that requires usage credits, with a vague "if capacity allows, we'll extend the included windows." I expect a lot of people to simply stop using it. I will use it sparingly, only when I really need it, because day-to-day coding on this thing would be a thousand dollars a day or more at API prices.

What makes the rug pull sting is that this is not a model that only earns its keep on hard problems. Fable makes the simple tasks better too, faster, with less prompting from you. It is the opposite of the GPT Pro models, which you reach for only when something is genuinely complex. I would want Fable as a daily driver. So pulling it out of the subscriptions on June 23 is a real shame, and I would bet usage drops off a cliff if they go through with it.

The part that worries me

The last thing, and the one I keep coming back to, is not really a safety concern so much as a concern about how this release was handled.

This is Jeremy Howard's point, and he is worth listening to. He founded Fast.ai, which is still the way I would tell any programmer to learn AI, and he has been thinking hard about these issues since the early days. He democratized access to this stuff. When Fable came out, he was not happy, and what he said stuck with me. Roughly: silently sabotaging experiments in order to stifle scientific progress and protect a technological lead, that sounds familiar, welcome to Self-anthropic.

What he is reacting to is the shape of Project Glass Wing, where Anthropic hand-picked which companies got Mythos class models before general availability, under the banner of checking for cybersecurity issues first. And the related fact that Fable will block you from certain prompts and route you to Opus 4.8 instead. Put those together and Anthropic is putting itself in a position to choose who gets access first, which means choosing who gets to move fastest on scientific progress and who gets the technological lead. Imagine that applied to biology, or pharma, or any research field. If one company decides who is first in line for the best model, that is a deeply unfair lever, and one they can steer in whatever direction suits them.

Howard's other point is sharper. The easy way to slow down recursive AI self-improvement, he argues, is for the lab with the top-ranked model to agree not to use it for frontier AI work, while letting everyone else have access. By definition, the frontier stops racing ahead. Anthropic has done the opposite. They are letting themselves, the current top lab, use their top model for frontier AI research, and they have signaled they will hold back others who try. So the frontier keeps advancing and the power imbalance widens.

Strip it down and it is this: Anthropic is using its best model to build its next best model, and not extending that to anyone else. That is the move that turns a field of a few close labs into one lab breaking away and pulling the ladder up behind it. That worries me. OpenAI is almost certainly at something like a code red right now, and I would expect them to push out a fast answer to Fable, or at least try.

That is the review. Fable 5 is the real thing, a genuine leap in what these models can do across coding, design, and research, and at the same time it is the most expensive model I have used and the one whose rollout gives me the most pause. I am very curious what you end up building with it, and what you make of how Anthropic is handling it.

NORMAL main claude-fable-5-mythos-review.md utf-8 markdown 100% Ln 1, Col 1