Capability Elicitation: Getting Models to Use What They Already Know
Most teams debugging a bad LLM output reach for the same fix: rewrite the prompt. Add more instructions. Clarify the format. Maybe throw in a few examples. This is prompt engineering in its most familiar form — making instructions clearer so the model understands what you want.
But there's a different failure mode that better instructions can't fix. Sometimes the model has the knowledge and can perform the reasoning, but your prompt doesn't activate it. The model isn't confused about your instructions — it's failing to retrieve and apply capabilities it demonstrably possesses.
This is the domain of capability elicitation. Understanding the difference between "the model can't do this" and "my prompt doesn't trigger it" will change how you debug every AI system you build.
The Gap Between What Models Know and What They Show
Consider a simple experiment. Ask a frontier model to solve a novel combinatorics problem with a straightforward prompt. It fails. Now add "Let's work through this step by step" — same model, same weights, same training data — and it produces a correct solution. Nothing changed except the activation pattern your prompt triggered.
This isn't a toy example. Research from the UK AI Safety Institute found that elicitation techniques can improve model performance by amounts comparable to increasing training compute by 5–20x. That's the difference between a model that seems incapable and one that performs at state-of-the-art levels — separated only by how you ask.
The canonical demonstration is chain-of-thought prompting. Wei et al. showed that PaLM 540B, prompted with just eight chain-of-thought exemplars, achieved state-of-the-art accuracy on the GSM8K math benchmark — surpassing even fine-tuned GPT-3 with a verifier. The model always had the arithmetic capability; standard prompting just didn't activate it.
Prompt Engineering vs. Capability Elicitation: A Useful Distinction
Prompt engineering, as most teams practice it, focuses on instruction clarity. You're optimizing the signal: reducing ambiguity, specifying format, providing context. It answers the question "does the model understand what I want?"
Capability elicitation asks a different question: "what is the best performance achievable under any reasonable configuration?" You're searching for the prompt strategy, tool configuration, or scaffolding that maximizes what the model can actually do.
This distinction matters because each leads to a different debugging workflow:
- Instruction problem: The model misunderstands the task. Fix: rewrite the prompt to be clearer.
- Elicitation problem: The model understands the task but doesn't engage the right reasoning pathway. Fix: change how the model approaches the problem.
When you're stuck at a performance ceiling and prompt rewrites aren't helping, you're probably facing an elicitation problem, not an instruction problem. Recognizing this saves you from the common trap of endlessly rephrasing instructions when the bottleneck is elsewhere.
Five Elicitation Techniques That Actually Work
Research and production experience point to a hierarchy of approaches, roughly ordered by effectiveness and implementation cost.
1. Structured Decomposition
Break complex tasks into explicit reasoning steps. Chain-of-thought is the best-known version, but the principle extends to any technique that forces the model to show intermediate work. This works because large language models process tokens sequentially — giving them intermediate steps means each subsequent token is conditioned on richer context.
The key nuance: decomposition only helps for models above roughly 100 billion parameters. Smaller models produce illogical chains of thought that actually decrease accuracy. Know your model's scale before assuming decomposition will help.
2. Analogical Priming
Recent research published in Nature Communications found that human-provided analogies can amplify LLM performance by up to 10x on novel problems. The mechanism: analogies activate latent knowledge about a structurally similar domain, and that knowledge transfers to the target problem.
In practice, telling the model "this problem is similar to X" — where X is something well-represented in training data — can unlock capabilities that direct questioning misses. If your model struggles with a scheduling optimization problem, framing it as analogous to bin-packing might activate stronger reasoning pathways.
3. Expertise Framing
Assigning the model a specific expert role ("You are a senior distributed systems engineer") does more than flavor the output. Role prompts activate domain-specific knowledge clusters in the model's representations. Specially designed role configurations consistently outperform default settings across multiple domains.
The important detail: generic roles ("you are an expert") help less than specific ones ("you are a database reliability engineer specializing in PostgreSQL replication"). Specificity determines how targeted the knowledge activation is.
4. Multi-Attempt Selection
Generate multiple candidate responses and select the best one. This sounds trivially obvious, but the performance gain is non-trivial because LLM generation is stochastic. A single sample might land in a poor reasoning trajectory; five samples dramatically increase the probability that at least one follows a strong path.
Self-consistency — generating multiple chain-of-thought paths and taking the majority vote — is the formalized version. In production, even simple best-of-N sampling with a lightweight scoring function can meaningfully improve output quality.
5. Tool-Augmented Elicitation
Sometimes the latent capability exists but requires external scaffolding to express. Giving a model access to a code interpreter doesn't teach it math — it lets the model offload computation it can reason about but can't reliably execute in-context. The AI Safety Institute's protocol specifically identifies tool access as a primary elicitation vector: command lines, Python interpreters, and web access can all unlock capabilities that pure text generation constrains.
The Diagnostic Framework: Can't vs. Won't vs. Doesn't
Before you start optimizing, you need to distinguish between three failure modes:
The model lacks the capability. No amount of prompt engineering or elicitation will help. The knowledge simply isn't in the weights. Signs: the model produces confident-sounding but factually wrong answers across all prompting strategies, or the task requires knowledge from after the training cutoff.
The model has the capability but your prompt doesn't trigger it. This is the elicitation gap. Signs: the model sometimes gets it right (inconsistent performance), it succeeds on simpler versions of the same task, or different prompting strategies produce wildly different quality levels.
The model has the capability but its safety training blocks it. This is alignment-related. Research from the Elicitation Game study shows this is a distinct category: anti-refusal techniques and fine-tuning can unlock capabilities that prompting alone cannot reach, particularly in code-generation tasks where circuit-broken models require stronger interventions.
The diagnostic sequence:
- Try five to ten different prompting strategies. If performance variance is high, it's an elicitation problem.
- Try simpler versions of the task. If the model succeeds at easier variants, the capability exists — you need better elicitation for the harder version.
- Check if other models of similar scale succeed. If they do, the capability likely exists in your model too.
- If nothing works across multiple models and strategies, you've probably hit a genuine capability boundary.
Why Teams Get Stuck on Instruction Optimization
There's a reason most teams default to rewriting instructions rather than exploring elicitation: instruction problems are legible. When the model misunderstands your format requirements or ignores a constraint, you can see exactly what went wrong. Elicitation problems are opaque — the output looks wrong, but you can't point to a specific instruction it violated.
This creates a systematic bias. Teams spend weeks refining instruction language when the real bottleneck is that their single-shot prompt doesn't give the model enough reasoning runway, their generic role assignment doesn't activate the right knowledge domain, or they're evaluating single samples when the model needs three attempts to reliably succeed.
Most organizations are stuck in "template standardization" when they need to advance to "systematic evaluation" — testing multiple elicitation strategies and measuring which one actually moves the performance needle for each task.
Applying This in Production
The practical workflow for capability elicitation in production systems:
Start with the diagnostic. Before optimizing anything, run your task across three to five different prompting strategies (direct, chain-of-thought, role-framed, analogical, decomposed). If performance varies by more than 20%, you have elicitation headroom.
Match technique to task type. Reasoning-heavy tasks benefit most from decomposition. Knowledge-retrieval tasks respond to expertise framing. Novel or unusual tasks improve with analogical priming. Generation tasks benefit from multi-attempt selection.
Layer techniques deliberately. The most effective production configurations combine multiple elicitation approaches: a specific role frame, chain-of-thought decomposition, and best-of-three selection. But add techniques incrementally and measure each addition — stacking everything at once makes it impossible to attribute improvements.
Budget for the cost. Elicitation techniques trade compute for capability. Chain-of-thought increases token usage. Multi-attempt selection multiplies API calls. Tool augmentation adds latency. The question isn't whether elicitation is free — it isn't — but whether the capability gain justifies the cost for your specific use case.
The Frontier Keeps Growing
Capability elicitation matters more, not less, as models improve. Each new generation has more latent capability to unlock, and the gap between naive prompting and optimized elicitation is growing because models develop capabilities faster than standard evaluation reveals them.
For engineering teams, this means your model evaluation process needs an elicitation step. Before concluding "the model can't do X" and reaching for fine-tuning or a different model, invest in systematic elicitation. The capability you need might already be there — waiting for the right prompt to activate it.
- https://forum.effectivealtruism.org/posts/xZLCGJKf8i73AdxDK/the-elicitation-game-evaluating-capability-elicitation
- https://www.aisi.gov.uk/blog/our-approach-to-ai-capability-elicitation
- https://aisecurityandsafety.org/en/glossary/capability-elicitation/
- https://arxiv.org/abs/2201.11903
- https://arxiv.org/abs/2212.03827
- https://www.nature.com/articles/s41467-026-70873-7
- https://www.sciencedirect.com/science/article/pii/S2666389925001084
- https://www.science.org/doi/10.1126/sciadv.adz2924
