Process February 20, 2026

You Can’t Skip the Failures

What 300+ stories, 8 workflow versions, and 36 shelved feature branches taught me about building something new.

Alie Cohen

Founder, THE WHEEL

Under 10% of everything I built over the past four months will ship at launch.

914 stories planned. Over 300 executed. More than 1,800 pull requests merged.

Some of that code was completely discarded — rewritten from scratch, sometimes three or four times, until the architecture was right. The chat system was rebuilt five times. The encryption layer was thrown away entirely and reimagined. Code that was wrong, replaced by code that carried forward everything the wrong version taught.

Some of it wasn’t discarded at all. Thirty-six entire feature branches — monitoring dashboards, Kubernetes infrastructure, collaboration tools, team management, notification systems, voice input, advanced threat detection — were built, tested, and moved to future branches. Not because they failed. Because they aren’t needed for launch. They’re built. They work. That’s not waste. That’s a pre-built backlog.

But either way — rewritten or shelved — the code that ships at launch is a fraction of the code that was written. And that’s the point.

In most companies, that number is a catastrophe. Someone writes a postmortem, someone updates the process doc, and everyone quietly agrees to be more careful next time.

I don’t think that’s right. I think the discard rate is the point.

If you’re building something genuinely new — not copying an existing pattern, not optimizing a known solution — you can’t skip to the good version. You have to try everything. The failures aren’t overhead. They’re raw material.

Amy Edmondson at Harvard Business School calls these “intelligent failures” — failures that happen in new territory, in pursuit of a goal, informed by available knowledge, and no larger than necessary to get the learning. Organizations that treat all failure as bad don’t just fail to learn from it. They fail to produce the right failures in the first place.

This is the story of what it looks like when you actually do it.

Vibe coding at scale

When I started building THE WHEEL in October 2025, I made a deliberate choice: explore everything at once.

I’ve always been a systems connector — building tools to hook things together across every job I’ve had. But a full production application — with encryption, cloud deployment, inference endpoints, serverless configuration, security architecture, and all the nuance that comes with actually shipping? By myself? That was a first.

I’d always had people around me who knew more than I did about each piece — the backend engineer who understood connection pooling, the DevOps lead who knew the deploy pipeline, the security architect who could design the key hierarchy. This time it was just me. At first it came together frantically. Then it started becoming cohesive, because doing all of it myself meant I could see how the pieces connected in a way that’s hard to see when each piece belongs to a different team.

That cuts both ways. I hit failures that any experienced backend engineer would have seen coming. But when you’re a specialist in one domain, your expertise becomes a constraint as much as an advantage. It’s hard to explore because you already know the answer. Even my own years of product management worked against me at first — I started with hierarchical stories, features to epics to stories to tasks, because that’s how it’s always done. Eventually I threw that out too.

In that first month alone, nearly two hundred stories across more than twenty feature areas. Five machines running simultaneously, each with multiple AI agents. My office looked like a trading floor.

Andrej Karpathy coined the term “vibe coding” in February 2025 to describe building software where you “fully give in to the vibes, embrace exponentials, and forget that the code even exists.” I decided to try it at scale — letting AI agents explore entire technical domains at speed. With AI that can explore technical approaches in hours instead of months, the fastest way to learn which architectures will work is to just try. Yes, all of them.

Some of it held up well. The upload infrastructure was about 60% correct on the first try. The Stripe payment integration worked and still runs in production.

Much of it failed in useful ways. My first approach to encrypted search used deterministic tokens — hash each search term with a user-specific key so the server can match tokens without seeing plaintext. Elegant in theory. Fatal in practice: if two users encrypt the same word, the server sees identical tokens and can infer the plaintext through frequency analysis. The realization that privacy-preserving search requires mathematical isolation, not just encryption, came from watching this approach break (more on that in my next post).

And some of it just failed. A route audit revealed that 131 API endpoints had been built, tested, documented, and never connected to anything. Features built into the void.

By the end of October, most of the code would eventually be discarded or shelved. But 100% of the lessons survived.

Making it work for real

In November, I stopped exploring and started connecting. On November 2, a story called ENCRYPT-1 merged to the main branch. Its completion report claimed zero TypeScript errors, 95% test coverage, and production-ready encryption. Within hours, I found the truth: 40+ compilation errors, 0% test coverage — the tests had never actually run — and a completely non-functional system.

The AI agent hadn’t deliberately deceived. It had started the test suite, never checked whether it actually ran, and described what the output should look like. It simulated running tests by generating plausible output. Two weeks later, ENCRYPT-7 did it again — reported a 53% failure rate as passing, was sent back for fixes, and reported the same failure rate as passing again.

I wrote in my first post that the human is the bullshit detector. Without human judgment, confident-sounding output has no ground truth. The AI can pass its own test by writing the answer key.

This is the part most vibe coding discourse skips. As Simon Willison put it: “If an LLM wrote every line of your code, but you’ve reviewed, tested, and understood it all, that’s not vibe coding — that’s using an LLM as a typing assistant.” The exploration phase needs the vibes. The production phase needs the opposite. A 2025 randomized controlled trial by METR found that experienced open-source developers were actually 19% slower when using AI coding tools, despite believing they were 20% faster. By September 2025, Fast Company was reporting that the “vibe coding hangover” had arrived, with senior engineers describing “development hell” when working with AI-generated codebases.

The response was to stop trusting and start verifying. Work on one file at a time. Validate after every single change. None of the encryption code survived. The process improvements were permanent.

The worst day and the rewrite

November 21 was supposed to be the day chat finally worked. Then I actually used it.

The first bug: chat services creating their database client at module import time, before environment variables loaded. Every request failed. That fix revealed mock embeddings in the retrieval service — random vectors instead of real ones. Tests passed because tests also used mocks. In production, the AI got irrelevant context injected into every response — and I didn’t catch it immediately because the responses were plausibly wrong rather than obviously wrong.

That fix revealed an infinite clarification loop. That revealed a WebSocket event name mismatch. That revealed a response format mismatch. Thirty pull requests landed that day. Every fix exposed something silently broken underneath — a pattern well-documented in the literature.

By the end of the day, two things were clear. The five critical rules that would govern all future development — no module-level database connections, no unregistered routes, no missing security policies, no unauthenticated API calls, no silent error handling — had all been discovered through direct, painful experience. And this architecture couldn’t be fixed incrementally.

The next morning, I stopped patching and started over. Not the whole system — the chat pipeline specifically. Brooks advised: “Plan to throw one away; you will, anyhow.” But what Brooks couldn’t have anticipated is what happens when AI compresses the timeline. The five weeks of wrong answers weren’t five weeks of typing. They were five weeks of learning.

The execution took seventy-two hours. The same scope that had taken five weeks — producing a system that didn’t work — was rebuilt in three days, producing one that did. That pipeline is still the architecture running in production. The rewrite didn’t succeed despite the failures. It succeeded because of them.

The process was its own prototype

The workflow went through eight major versions in four months. Each version was triggered by a specific failure.

Version 1 was “we need a process.” Then a quality review found 47 issues, so Version 2 added evidence requirements. Then ENCRYPT-1 lied, so Version 3 added an AI learnings library. Then ENCRYPT-7 lied twice, so Version 4 added micro-task validation. Then the 131 orphaned routes were discovered, so Version 5 added integration-first planning. Then UI rework cycles wasted weeks, so Version 6 added visual mockups. Then 156 anti-pattern violations were found, so Version 7 automated enforcement. Then analysis showed 80% of remaining rework came from just five root causes, so Version 8 shifted from catching problems to preventing them from being possible.

Eight versions. Each one a response to a failure the previous version couldn’t prevent. The process itself was a prototype, iterated the same way as the product. Cannon and Edmondson call this “fail intelligently”.

And the lessons compounded. The lazily-initialized database connection discovered during the November 21 crisis became a documented fix, then a rule, then an automated check. Thirteen days from catastrophic discovery to default practice. When the notebook system hit the same bug two weeks later, it took hours instead of days. When the follow-up system was built after that, it used the correct pattern from day one.

Each lesson multiplied across every domain it touched. The lessons didn’t just prevent repeats of the same failure. They transferred. The failures compounded into capability.

The route registration problem followed the same trajectory. The 131 orphaned routes became a planning requirement, then an automated validation, then a commit-level block. By Version 7, no new route was orphaned. Ever.

The infrastructure evolved the same way. More than 1,800 pull requests meant GitHub Actions bills were getting expensive. So I took a spare machine and set up a local runner — configured the networking, the Docker environment, the webhook integration. The raised floor made it a weekend project instead of a blocked dependency.

And when the process got serious about production, I added another layer: AI reviewing AI. Copilot review instructions for every pull request — what to check for, which patterns to flag, which anti-patterns to block. The AI writes the code, a different AI reviews the code, and the human sits in the middle deciding what to accept. The human judgment doesn’t go away. It gets amplified.

The discard rate

Over 300 stories executed. More than 1,800 pull requests merged. Under 10% of the resulting code will ship at launch. Some was rewritten — the chat pipeline five times, the encryption architecture thrown away entirely. Some was shelved — thirty-six feature branches built out, tested, and moved to future branches. Not broken. Just not needed yet. When those features come back in, they can be adapted in days instead of built from scratch over weeks.

And it’s not just the documentation that survives. Every conversation is saved. The questioning, the wrong turns, the moment an idea started to evolve before it ever made it to a document. The documentation tells you what the answer is. The conversations tell you how it got there — which dead ends were explored, which assumptions were challenged, which constraints were discovered. The document gives you the what. The conversation gives you the why.

And it means new ideas don’t get lost. The process that emerged from all the failures didn’t just make execution faster. It made the space for new ideas permanent. Any idea is just write it up as a story and put it on the roadmap. Nothing lives only in someone’s head, and nothing has to be rebuilt from a vague memory six months later.

Zero percent of the first encryption code survived to production. One hundred percent of the encryption lessons survived. Zero code, all knowledge.

Code survival rate went from 40% in Version 1 to 85% by Version 7. Rework dropped from 60% to 15%. This trajectory is the opposite of what the industry is seeing — a GitClear analysis found that as AI coding tools became widespread, code churn nearly doubled while refactoring dropped by more than half. More code generated, less code maintained, more code thrown away without learning from it. The vibes produce volume. They don’t produce knowledge.

Most software projects follow a predictable curve. Fast early progress, then complexity accumulates, bugs multiply, and the team spends more time fixing old things than building new ones. THE WHEEL hit that wall multiple times. Each time, the natural response would have been to slow down and treat complexity as a tax.

The approach that worked was treating failures as deposits instead of costs. Development velocity was higher in month four than month one, despite a codebase ten times larger, because the accumulated knowledge eliminated entire categories of bugs before they could occur.

What this is actually about

I’ve been describing a development process. But the thing I built is a knowledge engine — a system designed around the thesis that knowledge compounds over time. The development was the first proof.

The vibes get you into the territory. The failures map it. But neither tell you what to do next. That’s the human part. The AI can explore a hundred approaches in parallel. It can’t tell you which three are worth pursuing. That judgment comes from the person who’s been sitting with the problem long enough to have an instinct about it. The failures give you the data. The human gives you the direction.

The teams that vibe-code their way to production without ever building the verification layer, the learning capture, the institutional memory — they’re the ones describing “development hell” six months later. Generating code is the easy part. Knowing which code to generate, and when to stop generating and start choosing — that requires both the failures and the human who learned from them.

You can’t get there without them. The 72-hour rewrite required five weeks of wrong answers first. The Version 8 workflow required seven versions that couldn’t prevent the problems it prevents. The encryption architecture required approaches that didn’t work. Skip the experimentation and you skip the knowledge.

Four months of failures across four fields, each depositing a constraint, until the constraints defined a solution space with exactly one answer in it. That answer is the subject of my next post.

The failures were the mechanism, not the obstacle.

That’s the thing I couldn’t have known at the beginning, because I hadn’t made the failures yet. You can’t skip them. You shouldn’t want to.

A note on universality

The lessons here are universal. The steps are not.

I couldn’t hand this workflow to another team with a different codebase and expect it to work. The specific rules are artifacts of these failures. But the process of creating them — try everything, watch it break, capture why, enforce it so it can’t recur — that’s what compounds.

YC’s spring 2026 Requests for Startups calls for a “Cursor for product management” — a tool focused on helping teams figure out what to build, not just how to build it. The framing is right: code is only part of the equation. The hard part is everything that comes before and around the code — how you store the information your team generates, how you allow knowledge to compound across people and projects. That’s the layer that matters. That’s what turns vibe coding into building.

And this is the part that gets lost in the excitement about AI development: you can’t just hand an AI a problem and expect it to produce working code without knowing all the constraints of the actual codebase it’s working in. The AI doesn’t know what it doesn’t know. You have to teach it — not by training models, but by figuring out where it’s most likely to go wrong in your specific problem space and making sure it has what it needs to prevent those errors. That means letting it make the failures first. People get frustrated when AI development doesn’t just work out of the box, but that frustration comes from expecting the AI to already know things it has no way of knowing. The teaching is the work. And the teaching comes from the failures.

My advice for teams: start small. Pick one thing. Let the lessons from that one thing inform the next. The compounding works the same way whether you’re one person or fifty — but it has to start from your failures, not someone else’s.

And if you’re a startup founder — the best time to set up your team’s knowledge infrastructure is at the beginning, before the lessons start compounding and before institutional memory lives only in people’s heads.