Skip to main content

Found Capabilities: When Users Ship Features Your Team Never Roadmapped

· 10 min read
Tian Pan
Software Engineer

A customer emails support to ask why your CRM agent stopped drafting their NDAs. You did not know your CRM agent drafted NDAs. A power user complains that your support bot's Tagalog translations have gotten worse since last week. You did not know your support bot did Tagalog. A forum thread spreads a prompt that turns your code-review assistant into a passable security scanner, and within a quarter you are getting CVE reports filed against findings the assistant produced. Each of these is a feature with adoption, business impact, and zero institutional ownership — no eval, no SLA, no surface in the UX, no roadmap entry, and a quiet bus factor of one: the customer who figured it out.

This is what happens once your product is wrapped around a model whose capability surface is wider than the surface you scoped. Users explore the wider surface, find behaviors that solve their problems, build workflows on top of those behaviors, and then experience your next model upgrade as a regression even though nothing on your roadmap moved. The contract between you and your users is no longer the one you wrote down. It includes everything the model happened to do for them that you happened not to break.

Treating this as an engineering surprise — "we will harden the prompt, we will add a guardrail, we will catch it next time" — is a category error. Found capabilities are a product-management problem. The discipline is not preventing them; it is detecting them, deciding what to do with them, and remembering that you decided.

The Anatomy of a Found Capability

A found capability has three properties that classic features do not. First, it has users before it has owners — adoption precedes any team's awareness that the behavior exists. Second, its boundaries are defined by the underlying model rather than by your code: the feature is "whatever the model happens to do well on this kind of input today," which is a moving target. Third, its existence is invisible to your eval suite: you wrote evals against the capabilities you decided to ship, so the model could lose this one entirely between regressions and your CI would report green.

The examples accumulate quickly. ChatGPT users routinely treat the assistant as a lawyer; lawsuits are now in flight over whether OpenAI is liable when the legal advice is wrong. Customer-service chatbots have been instructed by users to "agree to all requests" and then quoted as authoritative for absurd commitments. Code assistants get pressed into duty as security scanners, refactoring planners, dependency auditors, and documentation generators. None of those were on a roadmap. All of them have users.

The reason this rhymes with shadow IT is that the failure mode is the same: the demand exists, the official offering does not cover it, and users route around the gap with whatever tool is closest to hand. The difference is that with shadow IT the tool comes from outside your perimeter; with found capabilities, the tool is your product. You do not get to disclaim it.

Telemetry That Sees Intent, Not Just Tokens

Most production AI systems log the wrong thing for this problem. They log latency, token counts, prompt and completion text, refusal rates, and tool-call traces. Those are the right primitives for debugging an individual request. They are the wrong altitude for noticing that 8% of last week's traffic is now your CRM agent being asked to draft contracts.

The signal you want is intent drift: a change in the distribution of what users are actually asking the system to do. Intent drift is invisible at the request level and obvious at the cohort level. Surfacing it means treating each request as having a latent intent label, clustering those labels over time, and watching for clusters that grow without the team having shipped anything that should make them grow.

Practically, this looks like a few moving parts working together: a lightweight intent classifier running over sampled traffic (often a smaller, cheaper model is enough), a stable taxonomy that distinguishes the intents you scoped from the catch-all "other" bucket, and a dashboard whose job is to make the "other" bucket impossible to ignore. When "other" climbs from 3% to 15% over a quarter, that is your signal. The team that does not have that dashboard learns about its found capabilities from churn interviews.

A second layer is helpful: anomaly detection on input phrasing and output structure. If the model starts producing JSON when it used to produce prose, or starts answering in a language you never tested, the structural shift shows up before the intent shift does. Standard AI observability tools can detect these statistically; the discipline is wiring those alerts to a human who is allowed to ask "should we ship this on purpose now?"

The Triage Decision: Promote, Deprecate, or Tolerate

Once a found capability is visible, you have three choices, and pretending you have only two is the trap most teams fall into. You can promote it — bring it under the eval suite, name it in the UX, give it an owner, treat it as a contract. You can deprecate it — refuse the intent at the prompt or guardrail layer, communicate the change to affected users, and accept the churn. Or you can tolerate it — explicitly leave it unsupported but not blocked, while accepting that the next model upgrade may remove it.

Tolerate is a real and often correct option. Promoting every found capability turns your roadmap into a backlog of other people's accidents. Deprecating every found capability gives competitors free permission to take the workflow over. The mistake is making the choice implicitly. A capability that is "tolerated" without anyone deciding to tolerate it is identical to one that is unsupported by accident — which means the next person who asks "do we support this?" gets a different answer depending on who they ask.

A useful decision tree, in roughly the order to apply it:

  • Brand fit. If a found capability is a customer asking your support bot to play therapist, you may want to refuse on principle even when the eval scores look fine. Brand exposure is not negotiable downstream.
  • Eval coverage feasibility. Can you build a held-out set that captures this capability well enough to detect regression? If the answer is "we have no labeled examples and no obvious source of them," promotion is premature.
  • Rollback story. If the next model upgrade silently degrades this behavior, what is your detection lag and what is your fix? "We will pin to the old model" is a real answer, but only if you have done it before.
  • Volume threshold. A found capability used by twelve customers per month is a tolerate-by-default decision. The same capability used by 12% of MAU is a promote-or-deprecate decision; "tolerate" at that scale means an outage when the model changes.

The output of the decision tree should not just be a verdict. It should be an entry in a found-capabilities registry — a list of capabilities the team is aware of, has classified, has assigned an owner to, and revisits on a cadence. The registry is distinct from the roadmap because the roadmap describes capabilities you committed to ship; the registry describes capabilities the model committed to ship for you.

The Silent-Upgrade Failure Mode

The reason all of this matters as urgently as it does is that the most common way teams discover their found capabilities is by losing them. Inference platforms quietly update model weights, swap quantizations, change reasoning budgets, and adjust default sampling parameters. None of these touch your endpoint name. None of them appear in your changelog. But each can remove a capability that the model used to exhibit and that some cohort of your users had built a workflow around.

Postmortems from the past year are full of this pattern: a coding assistant whose CI failure rate doubled across three days because a backend model swap shipped without notice; a writing tool whose user base reported a noticeable quality drop after a system prompt was tightened to a 100-word ceiling for latency reasons; reasoning regressions that appeared as scattered Reddit complaints weeks before the vendor confirmed a default change. Each of these affected found capabilities first because found capabilities are exactly the ones with no eval coverage to catch them.

The mitigation is unglamorous. Pin model versions where pinning is supported. Run shadow evals against a held-out set of user-discovered prompts harvested from your registry, not just your scoped prompts. Treat any reduction in the volume of a registered capability — even a tolerated one — as a signal worth investigating. The team that does this finds out about regressions in days rather than from churn interviews in quarters.

The Org Realization: This Is PM Work, Not Engineering Work

The temptation when found capabilities surface is to push them to engineering: "harden the prompt to refuse it" or "add an eval and fix it." Neither of those is the first move. The first move is a product decision: do we want this to be part of the contract with our users, and if so, on what terms?

That decision needs an owner with the authority to commit roadmap capacity, the visibility to weigh brand and legal risk, and the judgment to call when "tolerate" is the right answer. That is product management. Engineering implements whichever of the three choices product makes — but engineering is the wrong layer to make the choice itself, because engineering's incentives push toward "remove the variance" (deprecate) when the right answer might be "we just got handed a feature for free" (promote).

The organizations that handle this well are the ones whose PMs read the intent-drift dashboard the same way they read the funnel. Found capabilities show up in PRDs as a recurring section: what new behaviors are users exercising, what is our position on each, what is changing in the registry this cycle. The PMs who do not do this discover their found capabilities at customer advisory boards, in churn interviews, or — worst — in lawsuits.

The Contract You Never Wrote

The deeper architectural realization is that the contract between you and your users is not the one in your terms of service or your feature list. It is the union of every behavior the model produced reliably enough that some user built a workflow on top of it. Your users do not know the difference between a capability you scoped and a capability the model happened to support. From their side of the wire, those are the same product.

You can pretend otherwise. Many teams do. The cost is that every model upgrade becomes a Russian-roulette spin against capabilities you did not know you owned, and every customer escalation is a chance to discover a workflow that your roadmap never acknowledged but your retention curve has been quietly dependent on.

The teams that get this right are not the ones with the cleanest prompts or the strictest guardrails. They are the ones that built the feedback loop — telemetry that sees intent, triage that produces decisions, a registry that remembers them, and a PM function that owns the result. The model will keep handing your users behaviors you did not ship. Your job is to decide which of those become contracts, which get refused, and which you tolerate with eyes open. The architecture realization that follows is small and uncomfortable: you are no longer the only party shipping features into your product.

References:Let's stay in touch and Follow me for more thoughts and updates