Stop Defaulting to Claude Code
Imagine you hire the best mechanic in the world. Brilliant, experienced, knows exactly how to fix your cars. On day one you hand him a screwdriver and a giant ethics manual, and you tell him: every single time you touch this tool, stop and make sure that tightening this screw will not harm humanity. Not just before something dangerous. Every single time he touches a screw.
That mechanic does not suddenly become bad at his job. But you just handed him a terrible operating system.
That, more or less, is my problem with Claude Code.
I want to be clear up front, because this topic gets flattened into tribal nonsense fast. I don't think Claude models are bad. For some things I think they are still the best in the world. What I think is that Claude Code, the harness wrapped around the model, is often not the best way to use it.
So this is not a "Claude sucks" post. It is closer to: stop defaulting to Claude Code just because everyone on X told you to. Below I will walk through where Claude Code falls short, where Claude models are still the best, and which agents I would actually use depending on the job.
How Claude Code became the default
It helps to understand how we got here, because the popularity is not an accident.
The first big push came when Opus 4.5 landed late last year. Two things happened at once. Opus 4.5 made Claude feel genuinely magical for coding at the exact moment a wave of non-engineers were getting into app building. Anthropic called it their best model for coding, agents, and computer use, and the whole launch was positioned around agentic work. Perfect timing.
The second thing was reputational. During the Pentagon dispute, the messy details got compressed by the internet into a very simple story: Anthropic was the principled AI company, and OpenAI was the one that would let the government do whatever it wanted. I don't actually believe that framing is true, but it is clearly how it landed in public, and for the purposes of adoption that is what mattered.
Put those together and Claude Code became the safe, smart default. Demand shifted so fast that Anthropic started making some questionable moves just to keep up with everyone switching over. The hype cycle that had been building since late 2025 kept feeding itself. And here we are, with Anthropic about to IPO at a valuation higher than OpenAI's latest one.
I bring all of that up for two reasons. First, people are not dumb for using Claude Code. There is real context behind why it blew up. Second, and this is the actual point: there are other options, and for a lot of the work you are doing right now, they are probably better.
Default is a story about momentum and marketing. It is not a measurement of quality. The two get confused constantly, and the reason the difference matters comes down to one idea, the same one from the mechanic story. The model is the mechanic. The harness is the operating system you make it work inside. A bloated harness can tank a brilliant model's performance, not because the model got dumber, but because you buried it in instructions and tools it has to wade through every time it acts.
That single idea explains almost everything else here, so let me make it concrete.
What an agent harness actually is
At the core of every AI agent is an LLM, a large language model. It is the thing you talk to when you open ChatGPT or Claude. You can think of it as a program that takes text (sometimes images) as input and produces text as output. You ask a question in text, it answers in text. That is genuinely most of what an LLM does. It works with language. Hence the name.
The big breakthrough was not the model itself. It was realizing we could make these models do work for us, not just talk to us.
Picture a super intelligent being with no arms and no legs. It cannot move or act. It is basically an oracle, or one of those magic eight balls. You ask, it answers, and that is the end of it. What we did was bolt some robotic arms and legs onto the oracle so it could start taking action. The model still only outputs text. The trick is that we let that text do things. If the model can write text to a computer terminal, then the terminal lets it take almost any action on your machine: read a file, run a command, edit code, hit an API.
But the model does not magically know which tools exist or when to use them. You have to tell it. If a task needs it to read a file, it should call the read tool. If it needs to change a file, it calls the write tool. Those tools, and the instructions for using them, are handed to the model through the harness.
So when you send a message to your AI agent, the model at the core sees a lot more than the message you typed. It also receives, all as text:
[ system prompt ] who you are, your role, what you may and may not do
[ tool index ] the tools you can call, and when to call each one
[ context ] relevant files, prior steps, environment details
[ user message ] the thing the human actually asked forYou never see most of that. The agent provider wrote it into the product you are using. That whole bundle of plumbing, the part that explains the model's job, its limits, and its available tools so it can act in the real world, is the agent harness.
Now bring back the mechanic. The model is the mechanic. The ethics manual he has to consult before touching a single screw, all those standing instructions and procedures, that is the harness. A bloated harness can tank a brilliant model's performance, because every time it wants to act, it has to push through a wall of instructions and tool definitions first. Give it a heavy, defensive, over-engineered operating system and it will behave like a worse model, even though it is exactly the same weights underneath.
That is why the identical LLM can feel sharp in one tool and sluggish in another. Run Claude in Claude Code, then run the same Claude in Cursor, Replit, or Lovable, and you can get noticeably different results. Same brain, different operating system. I did not want this to stay at the level of vibes, so let me put numbers on it.
The benchmarks
Terminal-Bench compares CLI coding tools. Terminal only, no desktop apps. I like it here because it isolates exactly the thing I care about: what happens when you run the same Claude model inside Claude Code versus outside it.
| Model | Harness | Accuracy |
|---|---|---|
| Opus 4.6 | Claude Code | 58% |
| Opus 4.6 | Droid | 69.9% |
| Opus 4.5 | Claude Code | 52% |
| Opus 4.5 | Droid | 63% |
Same Opus 4.6, more than a ten point swing in accuracy depending on the harness. The 4.5 numbers tell the same story, and Terminus outperforms Claude Code on Opus 4.5 by a decent margin too. That is the kind of result that tells you how much weight the harness is carrying. The model is capable of a lot more than it sometimes looks like when it is boxed inside Claude Code.
The next one, DeepSWE, is different. Right now it is the benchmark I would point you to first when a new model drops and you want to see where it stands on cost, time, and output tokens. It lines up well with my own experience. This one is not about harnesses, it compares the raw models against each other, and it focuses on backend work. The takeaway is blunt: there is a large gap between Opus 4.8, the latest Claude model, and GPT 5.5. Even GPT 5.4 on high outperforms Claude Opus 4.8 on high. For backend software engineering in general, it is not close.
So hold those two findings side by side. The harness is hurting Claude in the terminal, and on backend work the GPT models are simply ahead on the model itself.
A UI bake-off
Backend is fairly objective. UI is not. You might like one design and I might like another, and outside of obvious usability problems, neither is "correct." So instead of a leaderboard, I ran an experiment.
I gave a deliberately loose prompt: make up a SaaS, come up with a color palette, design a dashboard, then a landing page. The only constraints were use Next.js, use shadcn, use Tailwind. Everything else was up to the model, including five variations. No design system, no references, no steering. I ran all of these at high or extra-high reasoning effort so nothing was sandbagged.
Four contestants: Opus 4.8 in Claude Code, Opus 4.8 in the Cursor harness (same model, different harness), Composer 2.5 in Cursor (Cursor's own, much cheaper model), and Codex with GPT 5.5 as the control group.
Opus 4.8 in Claude Code.
Not bad. Some animations and shadows, generally smooth, usable. But you start seeing what I call claudisms. There is the tell-tale outline treatment that Claude loves to slap on elements (if you have seen it, you know it on sight). And there is the sidebar. These models are obsessed with sidebars, they almost never put navigation on top, and it is always roughly the same shape. Claude got the gist of a shadcn-style sidebar but the bar is boring, there is no user menu where you would expect it, and the whole thing could have been a lot better. Perfectly usable, just uninspired.
Opus 4.8 in Cursor.
Same model. Clearly better. The spacing, margins, and padding are cleaner, the filters look tighter, and it got right what should not move: the top cards can bounce a little, but the components below sit still instead of wobbling like in the Claude Code version. The sidebar has a workspace switcher, the user tucked down at the bottom, and a status component that just makes the whole thing look more finished. The claudisms are gone. I cannot tell you whether Cursor is avoiding the Claude Code bloat or adding its own magic on top, but the brain is identical and the output is nicer. That is the point.
Composer 2.5 in Cursor.
This one surprised me. It is obviously not at frontier level, but it is far cheaper, and it finished in about five minutes where the frontier models took 25 to 30 to build out the full palette, brand, and component set. For the price, I am not sure there is anything like it. If your workflow is interactive, going back and forth and iterating quickly rather than firing one prompt and walking away, give Composer 2.5 a real look.
Codex (GPT 5.5).
I have said before that OpenAI models are not as strong at UI out of the box, and this shows it. Usable and decent, but very large margins on everything, and a shadow that overflows from one card onto its neighbor. Here is the thing though: I have used Codex for UI more than any other model, and it works, because if you feed it a large body of references, examples, and markdown that spells out your component library and styling rules, it does a really good job. It needs that steering. It is not the out-of-the-box experience you get from Opus.
The landing page round rhymed with the dashboards.
Opus 4.8 in Claude Code had real spacing and layout issues and a lot of "where do I even look" busyness. The same Opus 4.8 in Cursor was more intentional: more breathing room, a top bar that blends in with a nice glassmorphism effect, more diverse sections, less of that low-effort vibe-coded feel. Composer 2.5 was again fast and decent for the price. Codex showed the usual GPT quirks, like a big title-and-subtitle block on one side of the hero with an awkward empty space beside it, and sections that all blur together.
So what does this prove? One leaderboard should not decide your entire workflow. But the pattern is hard to ignore: the harness is doing real work, and Claude Code is not automatically the best way to run Claude. The model is the same. The result is not. And those claudisms, the repeated design fingerprints that let you spot a Claude-built UI from across the room, mostly melt away when you move the same model into a better harness.
So which agent should you actually use?
Here is the whole thing as a cheat sheet, then the reasoning.
| If you are doing... | Use | Notes |
|---|---|---|
| UI / frontend | Claude (Opus) | Best out of the box. Ideally outside Claude Code, but even inside it beats GPT for UI without steering. |
| Backend / systems | GPT 5.5 on Codex | Ahead on the model itself, and it does not need much steering. |
| Brainstorming / chat | Claude | It "gets" what you mean with the least hand-holding. |
Most people are doing this exactly backwards. Claude got famous for Claude Code and coding, a category that is honestly led by OpenAI and Codex right now, while the things Claude is genuinely best at get overlooked.
For frontend, Claude tends to win, and the best version of that is Claude in a harness that is not Claude Code. Even inside Claude Code it generally out-designs the GPT models out of the box. The key phrase is out of the box. With enough steering you can absolutely get GPT models to good UI, but Claude gets you there without much direction, which is what most people want from a one-shot prompt.
For backend, the GPT models have been ahead for a while, and GPT 5.5 is genuinely that good. If you want properly built systems without babysitting the model, work in OpenAI's Codex app with GPT 5.5. And here is the part nobody brings up: portability. Anthropic effectively forces you into Claude Code unless you are willing to pay much steeper API prices. OpenAI lets you use your subscription almost anywhere, which means you can run GPT 5.5 in whatever harness you like, including tools worth tinkering with like Open Code Pi, Open Claw, or Hermes. If you like to experiment with your setup, that freedom is a real advantage, not a footnote.
For chat and brainstorming, that is a different job, and I would reach for Claude. I have said this for a while: while everyone was using ChatGPT to talk about life, Claude was quietly the better choice for it. It understands what you are going for without much steering, and that matters most for open-ended, not-strictly-technical thinking. That said, I expect GPT 5.6 to close this gap too. GPT 5.5 was a big jump in how the model feels to talk to, a long way from the rough 5.1, 5.2, and 5.3 era.
The verdict, then. If you want slightly more taste and the feeling that the model just gets you out of the box, go for Claude, ideally outside of Claude Code. If you want the job done right on the first try, reliably, go for GPT 5.5 on Codex.
The deeper point is not that one tool wins. It is that Claude Code is not automatically the king just because it became the cultural default. Pick the model for the model, pick the harness on purpose, and match both to the job in front of you.