Debugging With AI Is Still a Human Process

51 messages. 6 hours. 3 PRs. And one human who kept saying “that doesn’t make sense.”

In the previous post I wrote about a bug that looked like a catastrophic privacy failure and turned out to be contaminated reference material in the prompt-construction layer. That post was about the bug itself. This one is about what the collaboration with AI actually looked like while finding it.

I went back and pulled the full transcript from the debugging session. Not a summary, not a reconstruction — the actual conversation between the AI and me, timestamped and complete. I had even taken screenshots during the session itself, once I realized this might make an interesting post.

The session by the numbers

51
My messages
345
AI messages
6
Hours
3
PRs shipped

One thing you will notice in the quotes and screenshots below: I am not writing carefully optimized prompts. Debugging is a flow state. The language has to match the speed of the thinking. I am talking to the AI the way you talk to a colleague when you are both deep in a problem — short, direct, no time for polish. If I had to stop and optimize every message mid-investigation, I would lose the thread.

The AI’s first instinct was wrong

When I saw a new user getting back data that looked like it belonged to someone else, I told the AI this was a critical data leak and to investigate immediately. It started with the obvious theory: cross-user retrieval in the database layer.

That was the natural first theory. It was also wrong.

The AI did incredibly useful work with minimal prompting in those first few minutes: it pulled logs, traced the request through the system, identified what the retrieval path looked like. It was fast. It was thorough. And the theory it was building was plausible enough to be dangerous.

Because if I had accepted it, I would have spent the afternoon fixing the wrong thing.

The redirections are the real work

I counted roughly 15 moments in the transcript where I redirected the AI — told it to look somewhere else, questioned an assumption, or pointed out that its explanation did not fit what I was seeing. Those moments are not interruptions. They are the actual debugging.

And sometimes the redirection is just hitting Escape. Once you can see the AI is building confidently toward the wrong conclusion, the most valuable thing you can do is stop it before it wastes time and tokens going further down that path. Letting it finish a wrong answer does not make the next prompt easier. Cutting it off and redirecting does.

Here are some of them.

Early in the investigation, the AI concluded that the system had loaded another user’s stored data. I pushed back:

Me also check the actual embeddings because that should not have been able to happen because of the matrices and the keys

The AI had not considered the isolation layer. It was tracing the data path at the application level and had skipped the infrastructure that is specifically designed to make cross-user retrieval impossible. That redirection opened a completely different line of investigation.

Later, the AI built a theory about stale frontend state persisting across logout. That sounded reasonable. I corrected it:

Me but they should only be called during a chat. they shouldn’t persist.

The AI was assuming the data was cached client-side and written to the new account during onboarding. I knew the architecture: core facts are loaded per-request, not stored in the browser. That killed the theory. The AI had to re-examine from scratch.

The AI found a discrepancy: four facts retrieved for a user who had only saved one. That was a real clue. But was it evidence of cross-user contamination, or something else entirely? The AI surfaced it. I had to decide what it meant.

Even after we had narrowed down to the actual root cause, the AI treated the contaminated prompt examples as a minor cleanup task — something to tidy up now that the real investigation was done. It was not. The examples were the root cause. I set the scope. Here is what that part of the conversation actually looked like:

The AI removed the names from the prompt examples — and then, very confidently, added code comments listing the original names with a note that they should never be used again. The names were still right there in the codebase, just moved from one place to another.

So the AI tried again, this time suggesting other common names as replacements. Still not right. I needed names that could never be mistaken for real children.

The example children in every prompt template are now named after ’86 Mets players.

Setting severity is a human decision

There is a moment in the transcript where the AI was describing the situation as a “nuanced partial failure” — some layers held, some did not, the impact was limited. I responded:

Me this is a complete %#&@!$ failure

That is not a technical correction. It is a severity decision. The AI was framing the problem in a way that was technically accurate but operationally wrong. If you are building a privacy product and something that looks like a cross-user data leak turns out to be prompt contamination, the right response is not “well, at least the database layer held.” The right response is to figure out exactly where it is happening and then trace through everywhere that could access user data and make sure there are no other leaks.

After that message, the AI stopped qualifying its findings and started treating every anomaly as something that needed to be resolved before moving on.

Knowing the architecture is the unfair advantage

The single most valuable thing I brought to this session was not debugging skill or pattern recognition. It was knowing how the system is built.

The AI can read every file in the codebase, trace every log line, and generate theories faster than I can type. It has access to documentation describing the architecture, the flows, the design decisions. But having access to information and knowing it are different things. The AI does not know which theories to trust. It does not understand why the isolation layer makes cross-user retrieval impossible. It does not understand why the way data loads matters for ruling out a caching theory. It does not know that a validation layer already exists specifically for the problem it was trying to solve a different way.

Every one of those architectural facts killed a theory that the AI was pursuing confidently. Without them, the investigation would have gone in circles.

This is the part that does not show up in demos or marketing material about AI coding tools. The AI is fast, thorough, and tireless. It is also wrong in ways that are extremely convincing. The gap between “sounds right” and “is right” is filled by the person who built the thing.

What the AI was actually good at

None of this is a case against using AI for debugging. The opposite. Here is what it did well:

It pulled six hours of Cloud Run logs and found the relevant request traces in seconds. It read thousands of lines of source code across multiple services to trace a data path. It held multiple active theories in context simultaneously and updated its mental model when I redirected it. It generated targeted queries for each failure theory. It created three pull requests with tests, addressed four rounds of automated code review, and kept track of what had been fixed and what had not.

One person cannot do all of that in an afternoon without AI. The investigation would have taken days, not hours.

The moment the root cause became clear: not a cross-user data leak, but hardcoded example data in a prompt template that the model treated as real. The AI found the line. The human had already narrowed the search to the point where that line was the only thing left.

But the AI did not choose which theory to pursue first. It did not decide when to abandon a plausible explanation. It did not set the severity posture. It did not know that the architecture made certain failure modes impossible. And it did not decide that the fix should become prevention rather than a patch.

The fix is not the fix

Finding the root cause is not the end. Cleaning up the contaminated examples is not the end either. Cleanup is a one-time action. The same class of problem can come back the next time someone logs a real scenario for testing, or the next time an example gets copy-pasted into a prompt template without thinking about what data is in it.

So the real work after the debugging session was turning the fix into prevention. That meant: document the failure mode. Turn it into a rule. Add an automated check. Put a hard stop in CI so it cannot quietly ship again.

In this case, the names in every prompt example got replaced. An ESLint rule now warns if names that are not the examples ever appear anywhere in the codebase. The rule is in the linter, not in a wiki page. It runs on every commit, not when someone remembers to check. And the principle got written into the project’s development rules so that every future AI session starts with the context that real PII in example material is not allowed, even in comments, even in test fixtures.

That is the part the AI cannot initiate. It can build every one of those safeguards. It can write the ESLint rule, update the CI config, rewrite the prompt templates. But the decision that a one-time cleanup is not enough — that the fix needs to become a rule, the rule needs to become a check, and the check needs to become a hard stop — that decision comes from the person who understands what kind of product this is.

Debugging is not just finding the bug

The thing I keep coming back to from this session is that the AI never once said “wait, that does not make sense.” It moved confidently from theory to theory. It never questioned its own framing. When I gave it new information, it incorporated it smoothly and kept going. But it never generated the doubt on its own.

The doubt came from me. Every redirection in that transcript started with me noticing something that did not fit. The AI processed the redirections beautifully. It just never initiated them.

That is not a flaw in the tool. It is a description of what the collaboration actually looks like. The AI raises the floor of what one person can accomplish in a debugging session. The human supplies the judgment, the architectural knowledge, the severity calibration, and the willingness to say “I do not believe you yet.”

After the session, the prompt contamination became a rule. The rule became a check. The check became a hard stop in CI. The names got replaced. An ESLint rule warns if names that are not the examples ever reappear. And every future AI session starts knowing that real PII in example material is not acceptable.

The AI did not decide any of that. It built all of it.

This is the sixth in a series about building THE WHEEL. The first, AI Raises the Floor, is about what AI actually enables when the floor of accessible expertise rises. The second, You Can’t Skip the Failures, is about what happens when you use that raised floor to explore everything at once. The third, The Unsolved Search, is about the architecture problem behind private, multi-tenant semantic search. The fourth, Every Pixel Is a Decision, is about what happens when product decisions stop being abstract. The fifth, Small Things Can Look Catastrophic, is about what it feels like when a small bug convincingly impersonates a catastrophic one. This one is about what the collaboration with AI actually looked like from the inside.