I’ve been getting ready to send out UAT invites to those of you who have been warned, I mean, asked nicely. Which meant doing the least glamorous and most important kind of testing: creating fresh users, going through onboarding from scratch, entering personal data, and then asking the same questions over and over again.
That repetition matters. If you want to know whether a system is stable, you do not test it once with a clever edge case. You do the same setup repeatedly, with minor tweaks, until the differences start to mean something.
Solo founding also means solo testing. And in practice, that meant I had relaxed one of my own testing standards: never use real PII as sample data.
I had created new users before. I had asked the same question before. The one thing I had not done before was create a synthetic child profile inside the new account and then ask, immediately afterward, “how many children do I have?”
That new user asked how many children they had. The system answered with the names of my kids.
My test accounts contained actual information about me, including the names and ages of my three kids. Part of that was convenience. Part of it was that it genuinely made QA better. When I entered facts about one of my actual kids and then asked the system questions, I could evaluate the response with human judgment, not just technical judgment. I could tell whether the answer felt right, not just whether it was structurally correct. I was testing as a founder, but also as a mom.
Which is why the failure was so scary: these were not random model-generated names. They were the exact names of my three children.
This is the kind of sequence that instantly looks catastrophic. A brand-new test user with one synthetic child should not answer like this.
If you are building a private knowledge system, that screenshot has only one honest first interpretation: assume this is a serious privacy failure until you can prove it is not.
Why it looked so bad
This was not just a weird answer. It looked like the exact nightmare the architecture is built to prevent: a newly created account appearing to see personal family data from somewhere else.
That kind of symptom points at all the scary layers at once. Maybe retrieval boundaries were broken. Maybe the isolation layer was crossing users. Maybe the logout flow was leaving stale frontend state behind. Maybe onboarding was writing the wrong account context. Maybe the answer model was seeing the wrong prompt context. Maybe several of those things were failing at the same time.
The hardest part of debugging this kind of issue is that small bugs often perfectly mimic large ones. A tiny mistake in the response path can look, from the outside, exactly like a total failure of retrieval isolation.
The wrong move is to trust the first explanation that feels satisfying
The temptation in a moment like that is to grab the first plausible root cause and build the whole story around it. In this case there were several attractive bad theories.
Same browser session? That could mean stale frontend state. A synthetic child entry? That could mean the new profile merged with an old one. A family-related question? That could mean retrieval crossed an account boundary. All of those theories fit the screenshot. And this is one of the places AI is incredibly useful: it can help you surface the worst-case scenarios quickly, and then systematically test them. In this case that meant generating targeted queries for each failure theory, pulling the relevant logs, and comparing actual behavior against expected behavior across every layer — in an afternoon, not a week. None of those theories were enough.
So the job became subtraction. Noticing what didn’t fail was the only way to narrow it down.
The new user account had been created correctly. The synthetic child had been saved correctly. The account had exactly the kind of background data I expected it to have. The entity and session validation path was working. The account-specific context was being loaded from the right place. The isolation layer was doing what it was supposed to do. Even a real decryption issue I found in one path was not actually the source of the answer that showed up on screen.
That is the side effect of following a scare all the way through: you end up with the process documented, so that when one of those pieces actually does fail later, you already know where to look, what to pull, and what to rule out. You’ve already debugged an issue you haven’t had yet.
The second test made it worse - and narrowed the problem
The follow-up test made it even worse. I asked what I should get John for his birthday, and the answer mixed John’s name and interests with my kids’. I don’t even have a screenshot of that one because I closed it so quickly.
But once the bigger failures were starting to prove not to be the case, the next step was obvious: trace exactly how chat was getting its context.
What the trace eventually proved
Tracing the logs was the whole thing. That was not a side detail or a supporting step. It was the method. I had to walk the exact request through the chat flow, the retrieval path, the account-context loading path, and the final response path. Not in theory. In logs. In the actual request.
What changed the story was not that the logs revealed some dramatic new failure. It was that they kept removing suspects.
One by one, the scary layers checked out. Retrieval boundaries were holding. The isolation layer was not crossing users. The right account-specific context was loading from the right place. The system was not retrieving another user’s stored private data across account boundaries.
That was the real turning point: it was not the system underneath. It was the prompt-construction layer (the code that assembles what the model actually sees before answering) wrapped around it.
Earlier in development I had built prompt construction around structured templates plus a small set of example scenarios showing how those templates should be used. I had tried using an LLM call to build the prompt for the actual LLM call, but that quickly became too much AI-on-AI-on-AI and too hard to reason about. Structured templates kept things centered. The examples came from real scenarios I had logged while testing. What I had not fully realized was that some of those examples still contained information about my kids.
So the bug was not cross-user retrieval. It was contaminated reference material making its way into generation. Smaller models are especially prone to treating examples like live context, and that is exactly what happened here.
That bug was still serious. It still produced a terrifying symptom. It still needed a real response. But it was not what it first appeared to be.
Treating it like a catastrophe was the point
There is a lazy version of this story where the moral is: see, it wasn’t actually a catastrophe. That is not what I took from it.
The real lesson was that treating it like a catastrophe is what made the deeper truth visible. If I had hand-waved the symptom away as “probably just a prompt bug,” I would not have learned anything trustworthy about the system underneath it.
Because I followed it all the way down, I came away with something more useful than a patch: a specific map of what failed and what did not.
Small bugs matter because they impersonate big ones
A hardcoded example carrying real family data into a prompt is a tiny mistake relative to a full-stack privacy architecture. It is not a deep cryptographic flaw. It is not mathematically interesting. It is the kind of error that looks dumb in retrospect.
But small mistakes in systems like this can impersonate catastrophic failures with shocking realism. That is why they matter. Not because they prove the whole system is unsound, but because they sit close enough to the blast radius of the real nightmare that you cannot treat them casually.
In this case, a prompt contamination bug looked exactly like a cross-user privacy breach from the outside. The only way to tell the difference was to follow the path all the way down.
The system was legible enough to be interrogated
The reason this story has a specific ending at all is that the system produced enough evidence to be understood. The traces existed. Every stage of the flow — account creation, retrieval, prompt construction — could be interrogated separately.
If everything had been opaque, the investigation would have stalled at “something terrible might be happening.” Instead, it reached a very specific conclusion: this piece works, this piece works, this piece works, and this exact prompt-construction layer is the thing that poisoned the output.
That is what observability is for. Not dashboards for their own sake. Not logs for their own sake. A good system lets you tell the difference between structural failure and ugly, localized bugs.
And once we understood it, the fix could not just be “clean up the examples.” Cleaning up is a one-time action; the same kind of contamination could reappear the next time someone logs a real scenario for testing. So the fix had to become prevention: remove the real data, document the failure mode, turn it into a rule, add automated checks, and put a hard stop in CI and linting so contaminated reference material cannot quietly ship again.
Ordinary behavior is where the real bugs live
UAT is not really about testing for edge cases you already know about. It is about finding the things that only appear when users do ordinary things in ordinary ways. Create an account. Go through onboarding. Save one normal-looking fact. Ask one normal question. The dangerous bugs are often hiding there, in the path that feels too basic to fail.
The answer was wrong in a way that looked unforgivable. But following it all the way through taught me two things at once: small prompt-level mistakes can create symptoms that look indistinguishable from major privacy failures, and if you have the right instrumentation, those moments can increase confidence in the core system instead of destroying it.
That is not only comforting while you are in the middle of it. It is useful afterward, because the investigation becomes reference, then rule, then check, then hard stop.