Dogfooding Is Not an Eval Strategy
Every team building an AI product reaches the same comfortable conclusion: "We use it every day, and it works great." That sentence feels like evidence. It is not. It is the single most misleading signal in the room, and it gets stronger — more convincing, more wrong — the better your team is.
Dogfooding tells you the product runs. It does not tell you the product works. Those are different claims, and the gap between them is exactly where your launch goes sideways. The people who built the system are, statistically, the worst possible sample of the people who will use it. They share its mental model, they know its soft spots, and they have spent months training themselves to phrase requests the way the model likes. That is not a test population. That is a control group for a study you never ran.
This matters more for AI products than for traditional software because the input surface is open-ended. A button either gets clicked or it doesn't. A prompt can be phrased ten thousand ways, and your team only ever discovers the dozen phrasings that succeed. Dogfooding a SaaS dashboard misses some onboarding friction. Dogfooding an LLM product misses most of the actual input distribution.
The Expert-User Blind Spot
The curse of knowledge is a one-way gate. Once you know how a system expects to be addressed, you cannot un-know it, and your brain quietly assumes everyone else knows it too. For an engineer dogfooding their own agent, this means every prompt you type is pre-corrected. You don't ask the ambiguous question because you already know it confuses the router. You don't paste the messy 4,000-token document because you know the context window handling is shaky. You front-load the constraints the model needs because you wrote the system prompt that needs them.
None of this is conscious. That is what makes it dangerous. You are not gaming the eval — you genuinely believe you are using the product normally. But "normally" for you is a narrow, friendly corridor through the input space, worn smooth by months of practice. Your users arrive with no map and walk straight into the walls.
The effect compounds with seniority. The more expert your team, the more fluent their prompting, the more their daily usage drifts toward the happy path — and the more confident they become precisely because nothing breaks. False confidence scales with team skill. A weaker team would at least stumble into the failure modes by accident. A strong team routes around them so smoothly they forget the failure modes exist.
There is a second layer to this. Your team also knows which features are fragile, and they unconsciously avoid them. The half-built export flow, the tool call that times out under load, the retrieval path that returns garbage for short queries — the team has an internal heat map of "don't go there." Users have no such map. They go everywhere. Dogfooding systematically under-samples the exact regions of the product most likely to fail.
The Failure Classes Dogfooding Structurally Cannot Reach
Some bugs are not just unlikely to be found by your team — they are impossible for your team to find, because finding them requires not being your team. These deserve naming, because "test more" does not address them.
First-run confusion. Veteran internal users essentially never experience the first run. They onboarded once, months ago, and have never seen the empty state with fresh eyes since. Yet first-run is where users decide whether to stay. Industry data on AI products is brutal here: a large share of users abandon AI tools within the first week, and the most-cited reason is not model accuracy — it is confusion about what to do. The blank prompt box is a wall for anyone who isn't already an expert. Your team has never once seen that wall.
Malformed and naive input. Your team writes well-formed prompts because they know what well-formed looks like for this system. Real users paste half a spreadsheet, write one-word queries, switch languages mid-sentence, include the email signature, and ask the agent to "fix it" with no antecedent for "it." This is not edge-case input. For a large fraction of your user base, this is the input distribution. Your eval set, if it was seeded from internal usage, contains almost none of it.
Adversarial probing. Users will do things your team would never think to do, sometimes out of curiosity and sometimes out of malice. They will try to jailbreak the system prompt, paste prompt-injection payloads from a webpage, ask the agent to do something obviously outside its remit, and screenshot whatever weird thing happens. Your team shares the builder's mental model of "what the product is for," which is exactly the mental model an adversarial user does not have and does not care about.
Domain gaps. Your team has the domain knowledge the product assumes. If you built a legal-research agent, your team probably skews toward people who already understand legal research, or who have absorbed enough to dogfood plausibly. Your users include the paralegal on day three and the founder who has never read a contract. The product's implicit assumption of background knowledge is invisible to the people who have that knowledge.
- https://dev.to/polluterofminds/dogfooding-your-own-product-isn-t-enough-2gb9
- https://userpilot.com/blog/product-dogfooding/
- https://thevaluable.dev/expert-blind-spot-software-development/
- https://en.wikipedia.org/wiki/Curse_of_knowledge
- https://www.nngroup.com/articles/new-AI-users-onboarding/
- https://medium.com/procreator-design/why-do-most-ai-products-fail-at-onboarding-and-how-can-ux-fix-it-98b4669f1c78
- https://www.applied-ai.com/briefings/llm-evaluation-gap/
- https://earezki.com/ai-news/2026-03-21-llm-evals-on-real-traffic-not-just-test-suites/
