The Hardest Part of AI Is Shutting Up

On building an AI assistant that knows which capability to use, when to ask for more context, and when the smartest thing it can do is stop.

There’s a quality that almost every AI product shares: confidence. Ask it anything and it will give you an answer. A fluent, well-structured, completely assured answer. Even when it’s wrong. Even when it doesn’t have the information. Even when the honest response would be “I don’t know” or, more importantly, “I need one more piece before I answer that.”

Some of this is structural: language models predict the next token, and the next token is never nothing. Some of it is trained in on top of that: the fine-tuning that makes models “helpful” rewards answering over abstaining. Two reasons to keep talking when it should stop, and the user has no reliable way to tell “I found this in your information” from “I generated something that sounds like it could be.”

The trained-in part has a name: sycophancy. It became hard to ignore in April 2025, when OpenAI rolled back a GPT-4o update four days after release because it had started endorsing users’ delusional beliefs and praising bad ideas uncritically. The research backs this up: Cheng et al. found AI affirms users fifty percent more than humans do, and Sharma et al. found that both humans and preference models prefer sycophantic responses over correct ones a non-negligible fraction of the time.

But sycophancy is the visible part. The deeper problem is one layer earlier: before you can make an AI more honest, you have to decide what kind of answer the question deserves in the first place.

When I started building, I already knew the difference between a classifier and a generative model. I had spent years working with both. But the generative models had gotten so capable that I thought maybe the distinction didn’t matter anymore. One model, give it tools, let it figure out the routing on its own. The large AI systems work this way now. They have tool calls, they show you “thinking” traces, they can search the web or run code mid-conversation. So I tried the same pattern. The problem I kept hitting was that the routing and the generation were inseparable. The model’s instinct about what kind of answer I wanted was shaping which tools it reached for, which meant the decision about how to answer was already made inside the answer itself.

The first question is not what the model should say

The first question is: what kind of question is this?

If someone asks what the weather is, that’s one kind of problem. If they ask what they told me last week about their son’s school situation, that’s another. If they ask what to get Darryl for his birthday, that’s not one question at all. It is several hiding inside one sentence: who is Darryl, what do I know about him, what has he mentioned wanting, what budget exists, and does the user want a factual reminder or an actual recommendation? “What did we decide in last quarter’s planning meeting about the API migration?” is the same shape of problem in a work context: which meeting, which quarter, what counts as a decision versus a discussion, and is the answer in a document, a conversation, or both?

The first version of the system didn’t make that distinction. I did what felt natural: take the user’s message, send it to a generative model with as much context as I could attach, and let the model sort out whether it was dealing with memory retrieval, live search, synthesis, or small talk. It worked sometimes. Other times I’d get three paragraphs of confident mush because nothing in the architecture had ever decided what kind of question it was looking at. The model was improvising the routing and the answer simultaneously, and when it got the routing wrong, the answer inherited the mistake.

The fix was boring. Genuinely boring.

Before anything else happens, classify the message. Does this question need the user’s saved knowledge? Does it need live information? Does it depend on conversation history? Does it require actual reasoning, or is the answer already sitting in the data and just needs to be returned cleanly? Not glamorous. Not creative. But that first step determines everything that follows.

The expensive lesson about cheap work

The system always had two layers: a smaller model running inside our own infrastructure, and a frontier model for the final response. The mistake was assuming the internal model could do everything generatively with specialized prompts: classification, extraction, greeting detection, routing. One model, one mode, every stage handled the same way.

The fix was splitting those stages apart. Now, before the model that writes the response ever sees the question, a separate step has already decided what kind of question it is, what data it needs, and whether a generative response is warranted at all. The classification call returns a set of flags, not prose. A greeting comes in, a flag comes back, the greeting gets a greeting, and the expensive model never sees it. Hallucination has no place at that stage. Every time I let a model handle routing generatively, it would start answering before the routing was done, and the routing would bend to accommodate whatever answer it had already begun constructing.

No step in the pipeline is immune to mistakes. But when the classification is wrong, the failure mode is a wrong turn, not a wrong answer, and a wrong turn is easier to detect and recover from.

There’s another class of work that took me longer to separate out: extraction. If the system already retrieved the relevant context, a lot of the remaining work is just pulling structure out of what’s already there. If someone asks “When is Darryl’s birthday?” and the answer is sitting in a saved note, the job is to pull the date out, not to write a paragraph about it. Extraction benefits from precision and restraint, not eloquence.

Generative calls, whether to the internal model or the external one, are the stages with the most upside and the most risk. It took me a while to treat them that way instead of reaching for generation as the default at every stage.

Retrieval is really several different questions

Even after the system decides a question needs context, the natural impulse is to say “great, search everything.” I know because I built it that way first. The retrieval layer searched all available context for anything semantically close to the query and stuffed the results into the prompt. It felt thorough. The results were mediocre.

The problem is that different questions need different retrieval behavior. A question about a specific person is not the same as searching long-form documents. A question that depends on something you just said five minutes ago is not the same as a question about long-term saved background. A question about the best restaurant nearby is not solved by memory at all; it needs live data, and ideally it needs live data tied to an actual location instead of whatever generic city the model feels like imagining.

The API migration question from earlier is a good example. To the user it is one question. To the system it might mean: search meeting notes for “API migration,” filter to last quarter, look for decision language versus exploratory discussion, pull any follow-up tasks that were created afterward, and check whether the conversation continued in chat after the meeting. That is not one retrieval step. It is orchestration. The Darryl birthday question is the same pattern: look up the person, search documents, pull conversation history, maybe search the web for gift ideas within a budget.

Semantically close and actually useful are not the same thing.

This matters. When I was using one flat search pass for everything, the system would surface documents that felt related while still not answering the question. A model given that pile of half-relevant material does what models do best: connect dots that were near each other and present the result as if the dots were already connected. That is how grounded systems quietly become ungrounded. The research calls this context faithfulness failure: the model has the right information in its context window and still generates something else. I didn’t understand how common it was until I saw it happen repeatedly with questions I knew the system had the right context for.

The thing that changed the retrieval quality most was adding a knowledge graph alongside the semantic search.

How retrieval works

There are really three ways to find something. All three use the privacy-first architecture described in The Unsolved Search: the server performs math on transformed data and never sees the content in plaintext.

  • Deterministic search — exact matching, used for tags and metadata. Tags are stored encrypted with one-way tokens for lookup. You search for a tag, the system matches the token without ever decrypting the tag name. Fast and precise.
  • Semantic search — meaning-based. Your query and your documents are both mathematically transformed before storage and search. The server finds the closest matches using the transformed values. It never sees the original meaning, only the math. You say “birthday gift ideas” and the system finds notes that mention “present” or “what he wants” without the server knowing what any of those words are.
  • Graph-based search — think of it as a map of how things in your knowledge are connected to each other. It follows actual relationships, not text similarity. Darryl is a person. Darryl is connected to specific notes. One mentions pottery, another a budget entry for gifts from October. The graph traces those connections instead of guessing based on word proximity. The server sees the structure but not the content.

All three matter, and the system uses them together. Think of it as three lenses on the same library. Deterministic search handles the cases where the user’s words are the right words. The semantic layer catches what the exact words miss. The graph keeps both honest. And none of them require the server to see what you stored.

Without the graph, the system surfaced documents that felt related. With it, the system surfaces documents that are related.

That difference turns out to be most of the difference between a useful answer and a plausible one.

When the question does need the external model, this retrieved context is what gets sent to it, which means the quality of the retrieval directly determines the quality of the generation. But the external model is not supposed to see the real names or identifiers. Personal details are stripped before context leaves the platform and restored internally after the response returns. The external model works with protected context (your information with the personal details removed), and on our side, its output is discarded once the response is delivered to the user.

Because the graph is built entirely from the user’s own content, they control what goes into it.

How data is handled
  • Nothing is shared, surfaced externally, or used for training without the user’s explicit permission.
  • External providers receive protected context, not originals.
  • Internal logs retain request metadata for debugging but not the content itself.
  • Delete a note and the system is designed to remove the source content, the graph relationships, and the derived embeddings. That is how we think about the right to erasure.

The privacy questions around personal knowledge tools (what happens when you store information about other people, who controls that data, what deletion actually means at every layer of the stack, and how frameworks like GDPR apply specifically) are real and worth a longer conversation. I wrote about how we think about ownership and encryption in Privacy Is Not Protection.

All of which is another form of restraint. The retrieval layer’s job is not to find everything that might be related. It is to find what actually is, and stop there.

I knew better

I knew all of this before I started. A graph is not a language model. A classifier is not a language model. A deterministic match is not a language model. I still took the shortcut of making everything a generative call because it was easier to think about one architecture than five.

The system that actually works is mostly the other stuff: structured retrieval, routing decisions, context gathering, verification, graph traversal. The generative model enters late and leaves early. Most of the pipeline runs without one. And the shortcut I kept reaching for, just let the big model sort it out, turned out to be the long way around. It was slower, more expensive, and less accurate than building the parts that don’t generate anything at all. Many questions that come through the system now never reach the expensive external model, because the routing figured out they didn’t need it.

The work is not “retrieve context.” The work is deciding which kinds of context this question is actually entitled to.


Clarification is a real answer

This was the hardest behavior to actually commit to building.

If someone asks for a local recommendation and the system doesn’t know where they are, the correct answer is not a generic recommendation with a smile on it. It is: tell me your city. If they mention “my mom” and nothing in memory supports that reference, the correct answer is not a softer hallucination. It is a targeted follow-up.

This sounds obvious written down. It was not obvious to build. Recent work on selective prediction frames this as the abstention problem: a model that knows when it doesn’t know is more useful than one that always answers. There is a persistent temptation to guess rather than ask, because guessing feels like intelligence and asking feels like admitting a gap. But guessing is usually how the bluffing starts. The model senses the shape of what the user probably meant, fills in the blank, and you get an answer that sounds smooth right up to the moment you notice it was built on an assumption nobody approved.

Restraint, in this context, does not mean silence for its own sake. It means refusing to pretend the missing piece was optional. And when the question is about personal information and the retrieval layer fails, cannot reach the data, cannot confirm what is there, the system does not fall back to generating without context. It stops. It would rather say nothing than say something it cannot ground.

Reasoning should be earned

There’s a related problem I kept creating for myself: treating every query as if it naturally culminates in a big reasoning pass.

It doesn’t. “When is Darryl’s birthday?” is not the same task as “What should I get Darryl for his birthday?” The first is a factual lookup. The second is a recommendation problem. “When did we decide to deprecate the v1 API?” is a search. “Should we deprecate the v1 API?” is a judgment call that depends on usage data, migration status, and team capacity. When I sent both through the same heavy generative path, I paid more, waited longer, and created more opportunity for the model to decorate, soften, or drift.

This is where the earlier routing decisions start paying off. If the system has already determined that a question does not need reasoning, it can skip that expensive stage entirely and go straight to formatting the grounded result. That is not a degraded path. For factual questions it is the cleaner path. The answer stays close to the source material because no one asked the model to do interpretive work it didn’t need to do. It also means that when one stage fails (a retrieval source is down, a classifier returns low confidence, an external provider times out) the system can fall back to a simpler path instead of producing a confidently wrong answer. If retrieval finds nothing for a question that needed personal context, the system tells the model the context is missing. The response comes back shaped around that gap instead of pretending the gap does not exist.

The unnecessary reasoning step is often where honesty degrades. The answer was already there. Then the model got invited to improve it.

After generation

There is a related check I added late. After the model generates a response, the system checks whether its claims trace back to something that was actually retrieved. If a fact does not appear in the context the model was given, it gets flagged. When that happens, the response gets rewritten to remove the unsupported claim, or in serious cases, blocked entirely. In practice, this most often catches the model attributing a preference or a date to a person when the retrieved context mentioned the person but not that specific detail. It is not a perfect filter, but it catches the most common failure mode: the model connecting dots that were never actually connected.

The sentence still matters

None of this means the final response layer doesn’t matter. It does. People do not want to read model mannerisms all day. They do not want “Great question!” before every answer. They do not want “As an AI assistant” in a product they are already visibly using. They do not want five-paragraph-essay energy wrapped around a two-line fact. Someone has already made a t-shirt that just says “You’re Absolutely Right.” It sells.

So there is still a final shaping pass. But even there, the same restraint applies. I added a length guard after watching the humanization layer quietly inflate answers that were already fine. If the answer grows while being shaped, the safer move is to keep the original. Humanization is supposed to remove friction, not add performance.

The writing problem and the routing problem are the same problem seen from different distances. The response should sound like someone who understood the weight of the question. The system should only reach that response by taking a path proportional to that weight.


The invisible part has to become visible

The other thing I’ve learned is that users eventually need to see some part of the route. Not all of it all the time, but enough to understand what kind of answer they just got.

Without that visibility, every answer collapses into “the AI said so.” That is bad for trust and even worse for debugging. I spent weeks chasing bad answers that turned out to be retrieval problems, not model problems, and I only figured that out after I started surfacing the route.

Every response now carries a trace of which stages ran, what each one decided, and how long it took. The user can open a panel during any query to see that trace in real time. The trace lives in memory for the conversation and is not persisted afterward.

Once the route becomes legible, bad answers get easier to understand and good answers get easier to trust. A wrong answer is no longer mystical. It is a classification miss, a context gap, a reasoning error. The AI cannot verify its own work. The human is the one who catches it. The visibility exists so that when they do, the feedback has somewhere specific to land instead of just “the AI was off.”


The intelligence is in the restraint

Honesty, clarification, right-sized model choice, response shaping, traceability: these sound like separate concerns until you build them. Then they collapse into one underlying principle: the system should do the minimum amount of work necessary to be genuinely useful, and nothing more.

A greeting should stay a greeting. A factual answer should stay close to the fact. A missing-data case should become a clarification. A classification task should use a classifier. An extraction task should use an extraction step. A generative model should only enter when the job is actually generative. And when the question does deserve full reasoning across memory, documents, history, and live context, then yes, spend the tokens. But spend them on the right thing.

I did not start here. I started where it was easiest to start: one model, one prompt, hope for the best. Everything described in this post is what came after.

The hardest part of building an AI assistant is not making it talk.

It’s building a system that knows when to classify, when to extract, when to look something up, when to ask one more question, when to reason, when to trace its own work, and when the most intelligent thing it can do is stop.

The hardest part, still, is making it shut up.