AI Has No Taste, and We're Finally Fixing It
An AI agent will write you a thousand lines of solid code in thirty seconds, then turn around and design a UI that looks like every other AI UI. Purple gradients everywhere, no spacing, no animation, the whole thing screaming vibe-coded from across the room. You know the look. The question worth asking is why.
It is not that the models are dumb. They clearly are not. It is that they have no idea what good taste actually looks like, and until recently there was no real way to teach them. That is starting to change, and to see how, you have to look at the one thing every AI lab is obsessed with right now: getting these models to improve themselves. The whole gold rush at the moment is recursive learning, agent loops that run, check their own work, and get better without a human in the middle. But here is the wall. How do you get a model with no taste to improve its own taste? You need to build a way for it to recursively teach itself what good looks like, and the thing that makes or breaks that loop is evals.
So that is where I want to start, and then work toward how a new way of thinking about evals is finally giving agents taste.
Why code improves itself and taste doesn't
Recursive learning is the entire game now. It is the same shape as the loops everyone is suddenly building: something trips a trigger, the agent does the work, checks it, iterates, and goes quiet until the next trigger. For any of that to work, the model has to grade its own output and steer on the result. That grading mechanism is an eval.
In plain terms, an eval is the thing you bolt onto an agentic system to grade its outputs and feed those grades back, so it can course-correct in the direction you wanted. It is basically data labeling. A human looks at an output, or at a piece of real-world data, and labels it good or bad, and that label goes back into the system so it keeps optimizing. It runs the whole range, from labeling good driving versus bad driving in footage to train self-driving cars, all the way to good UI versus bad UI for a coding agent that works on frontend.
Traditionally a human does the grading. You keep a golden set, a reference for what good output looks like, and you test the AI against it. Two things make that hard to turn into a real loop. Someone has to keep updating the golden set, and to actually close the loop the machine has to grade itself instead of waiting on you. One shortcut the labs lean on is letting users do the evals for them. They ask whether you were happy with the result, and they feed that back into the system's memory so it can correct next time.
We have been running this exact playbook for a while now, and it has run straight into one wall: evals are failing to capture taste reliably. Which is the real reason LLMs are so good at code and so bad at UI. The recursive loop only works when there is a hard signal to push against, and code has one. Run the app, read the error log, fix it, run it again. Tight, fast, objective. Try something, hit a wall, read the error, fix, repeat, and eventually it works.
UI does not behave like that. The model designs something ugly, no spacing, no transitions, gradients for days. It ships it, runs it, the thing loads, and it concludes: nice job, done. There is no stack trace that says your UI is garbage, kill the purple gradients, the spacing is wrong. So as far as the model is concerned, all is well and it did great work. That missing error log is the whole problem.
The fix is not teaching the machine to feel
The first person I heard call this out cleanly was Edwin Chen, the founder of Surge AI. The entire company is built on this premise, and you can read it straight off their homepage. They open with Hemingway, Kahlo, and von Neumann, and ask what made them extraordinary. Their life experiences. War, love, triumph, loss. Then the line that frames the whole thing: data does for AI what life did for them. And the fork they put in front of you, which future do we choose, AGI that cures cancer and unlocks the universe, or AI optimized for clicks and hype.
That fork is exactly the point I started with. A feedback loop with the wrong incentive will optimize for the wrong thing, blindly and forever. What Surge actually does is data labeling, but a level up from the usual. They will have real literary experts label poems around the sounds and the feelings the poem is meant to convey, which gives the model a window into genuine human emotion. That is not going to hand an LLM feelings, obviously. But these things are giant statistical pattern machines, and if you feed them enough of how humans respond to a poem, they will reproduce something we actually read as tasteful.
The takeaway worth holding onto is this. It is hard, maybe impossible, to make a machine understand human experience from the inside, so we should stop trying. Stop trying to build recursive improvement that lets the machine bootstrap taste out of nothing. Build a different kind of eval instead. Make the model recursively compare its output against a large corpus of state-of-the-art human references. The reference becomes the golden set, and the golden set is the best work people have produced.
Ploy, and the lookbook trick
It is not just Surge anymore, the rest of the industry is catching on. The example that made it click for me is Ploy, the new company from Bryant Chou, who co-founded Webflow and ran it as CTO for over a decade. Ploy uses AI to build your marketing assets, landing pages and the like, except the output does not look like AI did it. It captures the essence of what the company is and produces something basically unrecognizable from what you would expect a model to spit out.
Chou went on YC's Lightcone podcast and ran a demo I keep thinking about. He pulled old YC partner startup sites off the Wayback Machine and fed nothing but the premise of each one into Ploy. First up was Posterous, Gary Tan's startup, dead simple blogs by email, the version from around 2008. Gary now runs YC, and the original site looks every bit its age, little Gmail buttons and all. Then he dropped the premise into Ploy and out came a 2026 Posterous that is, honestly, gorgeous. It does not read as AI slop at all.
He did another with a 2007 startup from a different partner, a YouTube for documents, the kind of thing you hosted on a physical server in your dorm room closet before AWS existed. Same result. The modern version looks like a real, considered, modern site.
The secret was in the admin panel he opened up. They call it the lookbook, and it is their curation of what they believe is the frontier of web design, somewhere around 3,500 prompts of the best designs they could assemble. Ploy does not copy any single one of them. It takes the vibes. And the line Chou used is the important one: that is how human designers actually work. The very best might invent something genuinely bespoke that nobody has seen. Most good designers get inspired by the best work around them and build from there.
See the pattern. You educate the model against a curated set of the very best human work, and you force it to keep coming back to it. The gate it has to clear is matching the patterns it can see in the reference, the layout, the spacing, the flow, the UX. It iterates until it matches. That is the emerging shape that is replacing the old idea of evals for recursive learning, and the whole thing starts from real human taste instead of trying to manufacture taste out of thin air.
So what? Look at what Meta just did
The question I try to ask about anything in AI is, fine, so what. Here is the so what. Meta seems to have decided this is the bottleneck, and they reorganized a chunk of the company around it.
Earlier this year Meta stood up an Applied AI division and reassigned somewhere between 30 and 50 percent of the engineers on its core product, infrastructure, and security teams into it. Around 6,500 people. Most of them found out through a surprise email, with two options: join the unit or leave. The work is data labeling and reinforcement learning from human feedback, the human-in-the-loop grunt work that makes the models better. These are seasoned engineers, not new hires excited about AI, dropped into a role one of them described to Wired as, and I am quoting, literally the gulag. Zero purpose all of a sudden, barely interacting with anyone, just tasks every week. It got bad enough that someone hijacked a livestreamed internal AI presentation and let loose an expletive-laced tirade aimed at a senior Meta AI exec.
The irony writes itself. Meta's current head of AI is Alexander Wang, who co-founded Scale AI, the data labeling company Meta went and acquired. Meta has been lagging in AI and mostly surviving off acquisitions, and they have apparently concluded that their biggest constraint is not model architecture, it is data. In today's context I think they are right. With enough of the right data, and agents that can now loop almost endlessly, you can finally build recursive learning that works. That is the bet, and reassigning thousands of engineers to feed it is how serious they are about it.
The real arms race is for your data
This is not only a Meta thing, it is intensifying across the whole industry, and you can read it off the subscriptions. Notice how Anthropic suddenly got more generous with Claude usage, and how OpenAI has been handing out Codex resets like candy. The next arms race is not for silicon. It is for human data, and you are the supply.
Look at the actual economics.
| Subscription | Price | Inference value at API prices |
|---|---|---|
| Claude, top tier | $200/mo | ~$8,000 |
| ChatGPT, top tier | $200/mo | ~$14,000, plus free resets |
| Maxed out with resets | $200/mo | $28,000+ |
At the highest Claude tier, your $200 buys you something like $8,000 of inference. OpenAI gives you closer to $14,000 of inference for the same $200 on the top ChatGPT plan, and that is before the free resets. If you really wanted to max it out, with at least one reset a month, you could realistically pull over $28,000 of inference out of a $200 subscription. Why would anyone torch that much money? Because how you use the model is free labeling. Every time you accept an output, reject it, retry, or rephrase the ask, you are telling them what the model got right and what it got wrong, and by now they have built ways to separate the useful signal from the noise. They are subsidizing your usage to harvest your evals. That is the most valuable raw material for the next great model, and it is not compute, it is you.
The good news, if you like to token-max, is that this is probably not going away soon. The only thing that really kills it is the money drying up, an AI bubble correction or something like it. And separately, inference keeps getting cheaper as data centers get more efficient, so the dollar figures might shrink over time, not because you are getting less, but because each token just costs less to serve.
I pointed a loop at real references
I did not want to leave this at the level of theory, so I built an experiment. The thing LLMs fail at hardest is taste in UI design, so that is exactly what I aimed at.
I use Mobbin a lot for design inspiration. You can search screenshots and full UX flows from basically every serious app and site, and, importantly, they ship an MCP. So I wired the Mobbin MCP straight into the agent and started with a simple prompt.
Retrieve and compile a UX study on Notion and present it to me with the relevant screenshots.
It came back with a genuinely thorough teardown of Notion's most important flows, screenshots and all, down to transitions. The editor, the emoji picker, page covers, how database views work, the desktop client, sharing and collaboration, even the marketing side. Because it could actually see what the app looks like and the key screens of the experience, it had something real to aim at.
Then I pointed it at Future OS, the throwaway app I have been using as a test bench across the last couple of videos. It is a fully vibe-coded fusion of Notion, Slack channels, and Dropbox file management, nothing production about it, and I have never touched its code by hand. I built a loop: pull the Notion UX study from Mobbin, then iterate between that study and the live app running in a real browser, and keep going until the app is at least as good as the real thing. Then I did the same for Slack, then for Dropbox. Each loop ran about two to three hours. The only thing I wrote was that one instruction. Go to Mobbin, learn the UX, and loop until you have replicated it in my app. That was it.
Here is how far it got.
The Notion side picked up a page cover, an emoji picker, the full slash-command menu, headings, code blocks, numbered and bulleted lists, dividers, @-mentions of people, nested sub-pages, the ability to create new pages. I prompted none of those specifics.
The Slack side got reactions, threaded replies that tuck under the message, a thread pane you can open, channel members, an invite flow, and public or private channels.
The Dropbox side got folders, files it generated and dropped in itself, and downloads.
It got there for one reason. It had real human work to iterate against. Notion, Slack, and Dropbox, built by people with taste, three teams of them. Hand the loop a reference and it climbs to it. That is the entire thesis sitting in one experiment.
What this fixes, and what it doesn't
So here is where I land. The thing we assumed would stay broken forever, taste, is about to stop being broken. Not because the models learned to feel, but because we stopped asking them to and started feeding them the best human work as the bar they have to clear. A whole category of stuff LLMs were bad at, the stuff that needs taste, is quietly going to get good.
But I want to repeat the line from my last post on loops, because it still holds. This does not let an agent loop drive a product forward. Matching the best reference is not the same as knowing what to build, and it is nowhere near inventing something new. Curated references close the taste gap. They do not close the judgment gap. Knowing what humans actually want, and coming up with something genuinely new, is still intrinsically ours, and I am not convinced that part is going anywhere. It lines up with what I wrote about software engineering being over: the build quality keeps climbing, and the human part keeps shifting up the stack toward taste, judgment, and knowing what is worth making.
The part I keep thinking about
If the job is increasingly about overseeing agents, there is a human cost that is easy to skip past. Fiona Fung, who leads the Claude Code teams at Anthropic, said something on Lenny's podcast that stuck with me. As engineers increasingly work on their own with their own fleets of agents, loneliness is turning into a real problem. Her team started pairwise programming lunches, where engineers sit and work side by side, not even necessarily on the same thing, and pick up new patterns just from watching how other people drive Claude Code. They run hackathons too, partly just to make sure everyone is in a room together.
That is a problem coming for a lot of us. Sit and manage agents all day and you can go a full day barely speaking to another person. But I do not read it grimly. I think it is the start of figuring out new and better ways to work, collaborating at a higher level of abstraction and letting the agents handle the floor. It fits the long arc of all of this: we keep relegating the repetitive work to machines and freeing ourselves to focus on the higher-level problems. How software teams rethink the way they work together over the next year or so is going to matter a lot, and honestly it might be the genuinely exciting part.
The fix for AI's missing taste turned out to be simple, and a little humbling. You give it ours. So point your next loop at the best work you can find, and watch it climb.