The Phantom Skill: When Your Agent Demonstrates Capabilities You Never Tested For
A customer posts a screenshot in your support channel. They've been using your scheduling agent to negotiate three-way meeting times across timezones in mixed English and Japanese, with the agent producing suggested slots in both languages and reasoning about Japanese business etiquette. It works. Leadership shares it on Slack with a fire emoji. The PM updates the marketing copy.
Nobody on the team wrote that capability. No eval covers it. No prompt instruction mentions Japanese, etiquette, or three-way coordination. The behavior is real, but it was never engineered, never measured, and is now in your product surface area.
This is a phantom skill: a capability your agent demonstrates that no test ever verified. It isn't a bug. It isn't quite a feature either. It's load-bearing behavior with no contract, and it's the failure mode that quietly defines what your "AI product" actually is.
The discovery moment is the trap
Phantom skills get discovered the same way every time. A user finds the behavior, the team is delighted, leadership labels it "emergent," and the workflow gets baked into customer expectations within weeks. The trap isn't the discovery — it's the moment after, when the team treats user-demonstrated capability as if it were team-engineered capability.
There's a category error embedded in the word "emergent." When researchers use it, they mean a capability that appears as model scale crosses a threshold — a property of the model itself. When product teams use it about their own agent, they almost always mean a capability that emerged in users' hands, not in the lab. The agent didn't suddenly get smarter; the user found a use case the team didn't imagine.
That distinction matters because the two have completely different reliability profiles. A capability that the team designed for has been tested across input variants, adversarial cases, edge conditions, and probably has regression evals protecting it. A phantom skill has none of that. It has exactly one data point: the screenshot that surfaced it.
When a user builds their workflow on top of that one data point, you've now committed to maintaining a behavior whose support surface you've never characterized. The next time the underlying model updates, the prompt gets a routine edit, or the tool inventory changes, you might break the phantom skill — and you won't know, because there's no test telling you it existed.
Why phantom skills accumulate fast in agent products
Phantom skills are not equally distributed across software. They pile up specifically around language models because of three properties that don't apply to deterministic systems.
The first is open-ended input. Conventional software has a bounded input grammar — a form has fields, an API has parameters, a button does one thing. An agent accepts free-form natural language, and any natural-language utterance is a potential probe of capability. There is no edge of the input space. You can't enumerate all the ways users will ask, so you can't enumerate all the capabilities they'll discover.
The second is composition. Even if every individual tool in your agent's inventory is well-tested, the combinatorial space of tool sequences is not. A user who chains a calendar lookup → a translation tool → a draft-email tool in an order your evals never covered has just exercised a capability your team didn't design. The model decides the sequence; the user decides the goal; nobody decided the joint behavior.
The third is the model's own willingness. LLMs are eager. They will attempt almost any task plausibly within their context, with confidence calibrated to nothing in particular. This eagerness is what makes them useful, and it is what makes phantom skills accumulate so fast: the agent will try, and most of the time it will produce output that looks correct. Users mistake "looks correct" for "is supported," and the team finds out later.
Put together, phantom skill accumulation is a structural property of agent products, not a sign of sloppy engineering. Any team shipping a sufficiently general agent will accumulate them. The question is whether the team has a discipline for finding them.
The model upgrade is when phantom skills die
Phantom skills don't usually break from your changes. They break from the provider's changes. When a foundation model gets an upgrade — a new minor version, a quiet refresh of the underlying weights, a system-prompt-side change in safety filters — the behaviors you've measured in evals tend to survive, because that's what your evals are for. The behaviors you haven't measured don't have that protection.
Concrete instance: a team migrating from one model version to a newer one of the same family found their prompt-injection resistance had dropped from 94% to 71% on the same eval harness. That regression was visible because they had an eval. The same upgrade likely shifted dozens of other behaviors that no eval covered. Some of those shifts improved things. Some of them broke phantom skills the team didn't know existed.
The pattern in incident reports is consistent. A user reports that a workflow they've used for months "just stopped working." The team investigates. The prompt is unchanged. The tool definitions are unchanged. The application code is unchanged. The only thing that changed is the model. And nobody can say for sure when the capability worked or stopped working, because nobody was watching for it.
This is the org failure mode the topic was pointing at: phantom skills enter your product surface in the form of customer expectations, and they exit your product silently when the model that produced them updates. Users notice. Support tickets follow. The team has no engineering response because the capability was never theirs to maintain.
Discovery one: production-trace mining
The first discipline is structured: regularly mine production traces for capability claims that your eval suite does not cover. This is not the same as standard observability. Standard observability is set up to flag failures — error rates, latency spikes, tool call timeouts. Phantom skill discovery requires looking at successes and asking which ones are unexplained.
A workable version of this loop has three steps. First, sample successful agent traces — completed user sessions with positive implicit signals (user didn't retry, didn't abandon, returned the next day). Second, characterize the input patterns: language, domain, tool sequence, output type. Third, cross-reference against your eval inventory. The set of input patterns that appear in production but don't appear in any eval is your phantom skill surface.
The instinct here is to immediately add evals for everything you find. Resist that. Most of what you find will be one-off probes — users testing the limits, never to repeat. The signal is recurring uncovered patterns: capabilities that appear in dozens of sessions, used by multiple distinct users. Those are the workflows that will burn you when they break.
A useful framing: production traces are the only complete behavioral spec your agent has. Your prompt is a hint. Your evals are a check on a small slice. The traces are ground truth about what your agent actually does in the world. Mining them isn't a nice-to-have observability feature — it's the only way to discover what you're actually shipping.
Discovery two: anti-eval audits
The second discipline reverses the usual eval question. Standard eval review asks: "Are our evals covering the behaviors we want?" The anti-eval audit asks: "What is this agent doing in production that no test verifies?"
The mechanic is uncomfortable on purpose. Pull a list of features the marketing site claims the product supports. Pull a list of capabilities the system prompt instructs the agent to perform. Pull a list of customer use cases from sales calls and support tickets. Now compare those lists to your eval coverage. The behaviors that appear in any of the input lists but not in evals are anti-evals — assertions of capability without verification.
Most teams discover their anti-eval surface is larger than their eval surface. That's not a failure of the eval team; it's a structural feature of agent products. Marketing teams describe capabilities, customers report capabilities, the system prompt aspires to capabilities — and the eval inventory is always smaller than the union of those claims.
The audit's value is that it makes the gap concrete. Every entry in the anti-eval list is a decision: write the eval and bring the capability into the supported set, or delete the claim and tell users this isn't a supported workflow. Letting it stay in the limbo of "we say we support this but we don't test it" is the actual failure mode.
A practical cadence: run the audit quarterly, after each major prompt revision, and after every model migration. The output is a simple two-column ledger: claimed capability, eval coverage status. Anything in the "claimed but not covered" column is either getting an eval or getting deprecated.
Discovery three: capability-scope contracts
The third discipline is preventative: write down what the agent is supposed to do, and explicitly mark everything else as out of scope. This sounds obvious. It is rare in practice.
Most agent products document capabilities the way feature lists work — bullet points of what the agent can do. That format is incomplete. A capability-scope contract is two-sided: the supported list, and the explicit not-supported list, with a maintained "incidentally functional" middle category for behaviors that work but aren't guaranteed.
The middle category is the real innovation. It captures phantom skills without committing to them. When the agent does Japanese-business-etiquette reasoning in mixed-language sessions, the contract says: "this works in current production but is not tested, not guaranteed across model updates, and not part of the support surface." Customers who build on it know they're building on sand. The team knows what they don't owe.
This isn't legalistic hedging. It's the explicit acknowledgment that an agent has three layers of behavior: tested-and-supported, tested-and-known-to-fail, and untested-but-functional. Treating the third layer as if it doesn't exist is what produces the surprise outage when a phantom skill breaks. Naming it gives the team a category for it and gives users an honest signal about what to depend on.
The contract becomes load-bearing during model migrations. Before the migration, you run evals on the supported set — that's regression protection. You sample-test the incidentally-functional set — that's drift detection. You don't worry about the not-supported set. The migration playbook becomes legible because the scope contract told you which behaviors deserve which level of protection.
The org-level shift
Phantom skills are, fundamentally, an honesty problem. The team doesn't want to admit that some of the agent's behavior is unintentional. Leadership doesn't want to admit that the demo-worthy capability is unverified. Customers don't want to hear that the workflow they're building on isn't supported. So everyone agrees to call it "emergent" and move on.
The shift the discipline asks for is small but uncomfortable: treat user-demonstrated capability as a hypothesis, not a feature. The hypothesis becomes a feature when an eval covers it. Until then, it's a behavior that happens to work. That single rephrasing changes the conversation. PMs stop adding to the marketing copy until eval coverage exists. Engineers stop being surprised when a model update breaks something they didn't know they had. Customers stop being told "yes our agent does that" when the honest answer is "it has so far."
The teams that handle this well don't have fewer phantom skills. They have a faster pipeline from phantom skill → measured capability → supported feature, and they have an explicit policy for the behaviors stuck in the middle. The teams that don't handle it well discover their phantom skill inventory the hard way: a quiet model upgrade, a wave of "your agent is broken" tickets, and an engineering team trying to reconstruct what the agent used to do from screenshots in customer support threads.
The next time someone on your team uses the word "emergent" about your own product, treat it as a flag. What did the user actually do? Where in your eval coverage does that workflow live? If the answer is nowhere, you've just spotted a phantom skill — and the clock is already running on whether you'll measure it or lose it.
- https://www.promptfoo.dev/blog/model-upgrades-break-agent-safety/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://blog.langchain.com/agent-evaluation-readiness-checklist/
- https://arxiv.org/html/2512.16921
- https://venturebeat.com/infrastructure/context-decay-orchestration-drift-and-the-rise-of-silent-failures-in-ai-systems
- https://www.microsoft.com/en-us/research/publication/diagnosing-capability-gaps-in-fine-tuning-data/
- https://www.truefoundry.com/blog/llm-benchmarking-enterprise-production
- https://galileo.ai/blog/state-of-ai-evaluation
- https://noma.security/resources/shadow-ai-agents-enterprise-risk/
- https://www.zenml.io/blog/llmops-in-production-457-case-studies-of-what-actually-works
