One Agent or a Team of Agents? I Built the Team to Find Out.
OpenRouter just beat the performance of frontier models like Fable 5 by combining cheaper ones. I wanted to know if the same trick works for the thing I actually do all day, which is building software with agents, not research. So I built a system to test it. My first attempt at a multi-agent setup performed worse than a single agent doing the job alone. After three iterations I got a clear, repeatable improvement. This is the story of what changed between those two points, because that gap is where all the interesting stuff lives.
Let me start with the chart that kicked this off.
What OpenRouter actually showed
OpenRouter put out a piece called "Surpassing Frontier Performance with Fusion." The claim is that if you synthesize the results of several models instead of trusting one, you can beat what any single model can do on its own. Their fusion chart is the kind of thing that stops you mid-scroll. Solo Fable 5 gets outperformed by two instances of Opus. Then by Opus paired with GPT 5.5. Then by Opus, 5.5, and Gemini 3.1 Pro together, and the only thing that finally edges back ahead is Fable 5 fused with GPT 5.5, for a lift of about 4 percent over Fable solo. There is even a panel of Gemini 3 Flash, Kimi K2.6, and DeepSeek, with no Anthropic and no OpenAI in it at all, landing just under Fable 5 by itself.
You look at that and the obvious thought is: why would I pay for Fable 5 when three budget models stapled together get me to the same place? Fair question. It is also not that simple, and the reason is buried in one line of their methodology that is easy to skim past.
The fusion benchmark is a deep research benchmark. It measures reasoning, tool usage, and knowledge, and the tool usage they mean is mostly browsing the web to gather information. They ran it on 100 deep research tasks. Not agentic coding. Research.
That distinction is the whole post, so I want to sit on it for a second.
Why fusion is made for research, and why coding is different
Think about what a research task actually is. You go find information and you produce a good summary of it. That is the job. Now remember that these models are non-deterministic, which means two things at once. Run the same model twice and you get different information both times. Run different models and the spread gets wider still, because each one tends to reach for different sources and arrive at different conclusions. So when you hand all of that to a judge at the end, the judge is not averaging mush. It is sitting on a pile of varied findings and contrarian takes, and from that it can write a genuinely better conclusion than any single run produced. For information tasks, more models is more coverage, and more coverage is strictly good.
Agentic work does not behave like that, and the reason is the same one that makes human teams hard.
When one person owns an entire job, that person has all the context. They know exactly what the task is. They know where they are in it at every moment, because they are the only one in it. Hand that same job to five people and you have invented a new problem that did not exist before: communication. Now everyone needs to know where everyone else is working so nobody steps on anyone's toes. Someone has to keep the output aligned. The coordination is not free, and on a small task it can cost more than it gives back. A single capable agent holding the whole problem in its head is a real baseline, not a strawman. That is why naive fusion can lose.
So the question I actually wanted to answer was narrower than the chart: can you get the fusion uplift to show up in real coding work, where the coordination tax is real? The answer turned out to be yes. It just was not the answer my first build gave me.
The test
I picked one app and used it for every run so the comparison stayed honest. The brief was deliberately loose, because I did not want to direct the models. I wanted to see how far they would get on vision alone.
Build me the ultimate collaborative workspace. Think Notion plus Slack plus Dropbox in one, with anything else you think would make it better added on top. Surprise me. Local only, just for me to test. Strip out auth and anything you would need for prod, but make it fully featured so I can exercise every feature locally, exactly as it would behave in production.
That is the kind of prompt that separates models. There is nowhere to hide when you are not told which stack to use, what it should look like, or where to start.
Run one: a team of roles, and a worse result
My first instinct was to model a real team of humans, one role per model.
| Role | Model | Job |
|---|---|---|
| Team leader | Opus 4.8 | Takes my prompt, frames the request, routes the work |
| Tech lead | GPT 5.5 (Codex) | Owns the technical half of the spec |
| Designer | Opus 4.8 | Owns the UX and visual direction |
| Critics | Gemini 3.1 Pro + Kimi K2.6 | Tear the spec apart, adversarially |
| Engineer | GPT 5.5 (Codex) | Builds the thing |
| Reviewer | Opus 4.8 | Checks the build against the spec, then approves or loops |
A couple of choices in there are deliberate. I put Opus on design because GPT models are weak at UI without heavy steering, and I leaned hard on adversarial critics because forcing one model to demolish another model's plan, or its own, tends to surface insights that no amount of agreeable brainstorming will. When you make an LLM argue against a position, it digs.
The result was bad. The UI was all over the place, buttons that looked broken, a canvas feature that technically existed and barely functioned. Worse than that, it lost to GPT 5.5 running alone in the Codex app on the exact same prompt. The single agent beat my carefully staffed team.
That stung enough that I went looking for why, and I found two bugs in my own setup before I found anything profound about agents. Opus was running on high reasoning, but GPT 5.5 had been left on its default, which is medium, so the comparison was rigged against it from the start. It was also boxed in a sandbox that would not let it really test its own work. So the run was not a clean read on multi-agent versus single-agent. It was a clean read on me misconfiguring an experiment.
For a baseline I also handed the same brief to Opus 4.8 alone in Claude Code. Messy UI, but clearly ahead of both my first multi-agent run and the solo Codex run. It had a Notion-ish feel, a chat that looked decent, a file organizer, boards that mostly worked, even a rough whiteboard. A single frontier agent, no orchestration, quietly outscoring my whole apparatus. Point taken. I rebuilt.
Run two: stop handing off, start dueling
The thing I suspected was hurting me was the handoff. One model writes the code, a different model reviews it, and the context bleeds out in the gap between them. So I changed the shape. Keep the same spec-building team up front, but instead of GPT builds then Opus reviews on a loop, have both of them build the whole thing end to end, in parallel, and then let the reviewer pick the best parts of each.
I built a small app to run this, which made the whole thing easier to reason about. There is a war room where the team leader drafts the first spec and the critics and designer tear into it and enrich it. Then a build duel, where the Opus engineer and the GPT 5.5 engineer each build their own prototype. Then a judging room, where the reviewer puts the two side by side and decides.
brief
→ spec (team leader + tech lead + designer draft it)
→ critique (Gemini + Kimi tear the spec apart, it gets better)
→ build duel (Opus and GPT 5.5 each build the whole thing, in parallel)
→ judge (reviewer takes the best of both)GPT 5.5's output here was the worst of the bunch, and this is the run where I finally caught the medium-reasoning bug, because the result was so broken it forced me to look. A chat that sort of worked, a table-board thing that did not, a calendar that did not really exist. Genuinely terrible.
Opus, on high reasoning, was a different conversation. Full client-server architecture. Real documents. A chat with reactions that felt responsive and alive, the way a Discord-style channel is supposed to feel. I have built exactly this kind of thing before, a community chat that has to feel fast and fluid, back when the best help you could get was something like Sonnet 3.5, and I built it the slow way and it was hard. Watching a half-page prompt with zero technical references produce something this close to it, I almost wanted to cry. There were databases like Notion's, a file management system, and the taste was a real step up from Opus on its own, not just the feature count.
That was the moment it clicked. There was something here. So I did one more run to see how far it went.
Run three: clean conditions, and a clear win
For the final run I fixed everything I had been sloppy about. I pulled both models out of the sandbox, so neither one got an unfair edge from my local setup, and I bumped both Opus 4.8 and GPT 5.5 to extra-high reasoning. Same adversarial system: spec, critique, build duel, judge.
By this point the spec phase was the part I was most impressed by, and honestly it might be the real headline. Letting these models debate a spec against each other for about ten minutes gets you a document with a depth you would not get out of a single model, or out of yourself in that time. If you have ever stared at a blank spec, this alone is worth the setup.
Then the builds. GPT 5.5 on extra-high, inside the system, came out clearly better than GPT 5.5 on extra-high alone in the Codex app under identical conditions. Same model, same reasoning, same prompt. The only difference between them was that one went through the adversarial spec phase and the other started building straight from my loose prompt. The system version was usable. I would not ship its UI, but the functionality held together, the file management, search, inbox, home, documents. OpenAI models still struggle with visuals when you do not hand them a design system, and that has not changed. But the spec phase moved it from rough to workable.
Opus 4.8 on extra-high, inside the system, was the most impressive result of the whole experiment.
A user switcher. Channels that looked genuinely good. You could reply to a message and it threaded the reply off to the side. It flickered, it had bugs, it was not perfect, but it worked, and it had taste. The document system looked great and the formatting held, which is the kind of thing that used to fall apart constantly. Comments that you could resolve. File embeds. A homepage that took you into tasks you could check off and tag, a board view, a calendar you could drop events into, custom views you could add. The depth of it was the thing. This was not a prototype that falls over the moment you click past the happy path. It was something I would actually start iterating on if it were my project.
And here is the comparison that makes the point. The same model, Opus 4.8, given the same prompt, alone in Claude Code, produced a dashboard that was completely messed up, with taste nowhere near this. Identical brain. The structure around it is the entire difference.
So what actually carries the uplift
Two things, and neither of them is "use more models."
The first is the adversarial spec. Making models debate, critique, and rewrite the plan before a single line of code exists is where most of the gain comes from. A good spec in ten minutes is leverage that follows you through the entire build.
The second is the build duel with a judge. Two models build the whole thing in parallel, and a reviewer takes the best of each instead of one model handing off to another and losing context across the seam. The parallelism sidesteps the coordination tax that killed my first run, because the two builders never have to coordinate at all. They only meet at the judge.
What does not work is the obvious thing, the thing the fusion chart tempts you into: throw a pile of models in a room, tell them to discuss it, and let them go. That produced worse output than a single good agent. The uplift in agentic work is real, but you have to engineer the team. It does not fall out of stacking models the way it falls out of stacking research runs.
The verdict
The fusion result is real, but read the fine print. Panels of cheaper models beating frontier models is a fact about research, where non-determinism is a feature and a judge turns spread into signal. Coding is not that. Coding carries a communication cost, and naive fusion pays it without getting anything back. Get the structure right, though, an adversarial spec and a parallel build duel judged at the end, and you can pull a clear, repeatable lift out of multiple models on real software work too.
I wrote a couple of weeks ago that software engineering is not changing, it is over. This is more of the same evidence, from a different angle. I gave a system a half-page prompt with no stack, no references, and no direction, and it debated its own spec and built me a working collaborative workspace with taste. I think I am going to keep iterating on this ensemble setup, because there is clearly upside left in it, and I would love to run the whole thing on Fable next, where I suspect it gets better still. If you want me to put the app on GitHub so you can try it, or to go deeper in another video, let me know.