Capability Elicitation vs. Prompt Engineering: Your Model Already Knows the Answer
Most prompt engineering advice focuses on the wrong problem. Teams spend weeks refining instruction clarity — adding examples, adjusting tone, restructuring formats — when the actual bottleneck is that the model fails to activate knowledge it demonstrably possesses. The distinction matters: prompt engineering tells a model what to do, while capability elicitation gets a model to use what it already knows.
This isn't a semantic quibble. The UK's AI Safety Institute found that proper elicitation techniques can improve model performance by an amount equivalent to increasing training compute by five to twenty times. That's not a marginal gain from better wording. That's an entire capability tier sitting dormant inside models you're already paying for.
The Knowledge Is There — Your Prompt Doesn't Trigger It
Every large language model is an iceberg. Standard benchmarks measure the tip. The vast majority of what a model learned during pretraining — reasoning patterns, domain knowledge, problem-solving strategies across billions of tokens — sits below the waterline, accessible only under the right conditions.
Recent research from NeurIPS 2025 demonstrated this concretely: training as few as 10 to 100 randomly chosen parameters can recover up to 50% of the performance gap between a pretrained-only model and a fully fine-tuned one. Scale to a few thousand parameters and you recover 95%. The capabilities aren't missing — they're latent, waiting for the right activation signal.
This is why the same model fails a task with one prompt and nails it with another. The knowledge didn't appear between requests. The second prompt simply found the right path through the model's existing representations.
Prompt Engineering Optimizes the Wrong Layer
Traditional prompt engineering operates on a clear mental model: if the model's output is wrong, the instructions must be unclear. So you iterate on phrasing. You add constraints. You provide examples. You specify output formats.
This works — up to a point. For tasks that genuinely require instruction clarity — format compliance, role adherence, safety boundaries — better prompts yield better results. But for tasks that require the model to reason, recall, or synthesize across domains, instruction clarity isn't the bottleneck.
Consider a concrete scenario. You ask a model to diagnose why a distributed system experiences cascading failures under specific load patterns. The model gives a shallow answer about "adding more servers." You add more detail — describing the architecture, specifying the failure mode, requesting step-by-step analysis. The answer improves, but only because you did the reasoning yourself and embedded it in the prompt.
Capability elicitation takes a different approach. Instead of pre-digesting the problem, you structure the interaction so the model activates its own knowledge about distributed systems, failure modes, and cascading effects. The goal is getting the model to think, not telling it what to think.
Five Elicitation Techniques That Actually Work
Research from the AI Safety Institute and "The Elicitation Game" (a systematic evaluation of elicitation methods) points to a clear hierarchy of techniques, each unlocking progressively deeper capabilities.
1. Structured Decomposition
Break complex problems into sub-problems before asking for solutions. This isn't chain-of-thought prompting — it's asking the model to identify the sub-problems rather than solve them all at once. Chain-of-thought says "think step by step." Structured decomposition says "what are the three distinct problems embedded in this question?"
The difference is significant. Chain-of-thought prompting shows diminishing returns with newer reasoning models — recent Wharton research found only marginal benefits for models that already reason internally, while adding 35-600% latency overhead. Decomposition, by contrast, redirects the model's attention to activate different knowledge clusters rather than just generating more intermediate tokens.
2. Analogical Priming
A study published in Nature Communications found that providing cross-domain analogies amplified LLM performance by up to 10x on complex reasoning tasks. The mechanism is revealing: analogies don't provide new information. They redirect the model's attention within existing semantic structures, activating latent knowledge that the original framing failed to reach.
In practice, this means that when a model struggles with a novel problem, you shouldn't add more context about that specific problem. Instead, offer an analogy to a domain where similar patterns exist. "This is similar to how circuit breakers prevent cascading power failures" activates entirely different knowledge pathways than "analyze this distributed system failure," even though both prompts target the same answer.
3. Expertise Framing — With Caveats
Telling a model to "act as an expert" is one of the most common prompt engineering techniques. Recent research reveals it's also one of the most misunderstood.
A 2026 study found that expert personas improve alignment-dependent tasks — writing quality, safety compliance, format following — by activating instruction-tuning behaviors. But they damage accuracy on knowledge-dependent tasks. MMLU scores dropped from 71.6% to 68.0% when expert personas were applied. Math, coding, and factual recall all suffered.
The explanation is mechanistic: persona prompts activate the model's "instruction-following mode," which competes with factual knowledge retrieval. The model becomes better at sounding like an expert while becoming worse at being one.
The practical takeaway: use expertise framing for tone and structure, never for factual accuracy. If you need the model to recall precise knowledge, drop the persona and ask directly.
4. Combinatorial Prompting
"The Elicitation Game" — a systematic evaluation of elicitation techniques — found that combining methods dramatically outperforms using any single technique. Specifically, combining N-shot prompting with response prefilling achieved 55-72% capability recovery on tasks where either method alone recovered far less.
This suggests that different elicitation techniques activate different internal mechanisms. Few-shot examples prime pattern matching. Prefilling activates continuation behavior. Together, they create a richer activation landscape than either alone.
5. Tool-Augmented Elicitation
Giving models access to external tools (code interpreters, search, calculators) doesn't just compensate for model limitations — it changes how the model reasons. The AI Safety Institute found that tool access unlocks capabilities that exist within the model but require an external scaffold to express. A model that can't solve a math problem in natural language may solve it perfectly when given access to a Python interpreter, not because Python does the math, but because the coding context activates different reasoning pathways.
The Elicitation Gap Is a Measurement Problem
One of the most important implications of capability elicitation research is that standard benchmarks systematically underestimate model capabilities. "The Elicitation Game" demonstrated this clearly: models that appeared to have lost capabilities (through safety training or alignment interventions) often retained them — they were just harder to access.
For practitioners, this creates a measurement problem. When your model fails a task, you face an ambiguity: does the model lack the capability, or does your prompt fail to elicit it? The answer determines whether you should switch models (expensive), fine-tune (complex), or restructure your prompts (cheap).
The research suggests a practical diagnostic. Before concluding that a model "can't do" something:
- Try at least three structurally different prompts (not just rephrased versions of the same approach)
- Test with and without analogies from adjacent domains
- Combine techniques: few-shot examples plus decomposition plus tool access
- Test with explicit anti-framing: "Don't worry about being comprehensive, just focus on the core mechanism"
If none of these work, the capability may genuinely be absent. But in practice, teams abandon models or reach for fine-tuning far too early.
When Elicitation Isn't Enough
Capability elicitation has clear limits. "The Elicitation Game" found a stark divide between task types: on multiple-choice tasks, prompting techniques recovered near-full performance (50-72% of hidden capabilities). On code generation tasks, prompting recovered almost nothing (0-7%). Fine-tuning was the only technique that achieved full recovery on complex generative tasks.
The pattern generalizes: elicitation works best when the model needs to select or recognize the right answer from its knowledge. It works poorly when the model needs to construct complex, multi-step outputs from scratch. The expanded output space of generative tasks requires more than attention redirection — it requires the kind of behavioral conditioning that only parameter updates provide.
This creates a practical decision framework:
- Recognition tasks (classification, QA, analysis, diagnosis): invest in elicitation first
- Generation tasks (code, long-form writing, complex planning): elicitation helps at the margins, but fine-tuning may be necessary
- Hybrid tasks (code review, debugging, editing): elicitation for the analysis component, fine-tuning for the generation component
The Cost Asymmetry You're Ignoring
The economic argument for capability elicitation is overwhelming. Fine-tuning requires data collection, compute, evaluation pipelines, and ongoing maintenance as base models update. A single fine-tuning run can cost thousands of dollars and weeks of engineering time. Switching to a larger model multiplies inference costs permanently.
Better elicitation costs nothing beyond engineering time for prompt development. And unlike fine-tuning, elicitation techniques transfer across model versions. The analogical priming that works on one model generally works on the next, because you're leveraging fundamental properties of how language models represent and retrieve knowledge.
Yet most teams default to the expensive path. They see a model fail, try one or two prompt variations, then escalate to fine-tuning or model switching. The gap between "adequate prompting" and "proper elicitation" is likely the single largest source of wasted AI spend in production systems today.
What Changes When You Think in Terms of Elicitation
Shifting from prompt engineering to capability elicitation changes how you approach every interaction with a language model. Instead of asking "how do I tell the model what I want?" you ask "what does the model already know that's relevant, and how do I activate it?"
This reframing leads to different engineering practices. You stop adding context and start removing constraints that block knowledge retrieval. You stop iterating on instructions and start testing structurally different activation patterns. You stop blaming the model and start treating failures as signal about your elicitation strategy.
The models are more capable than your prompts let them be. The question isn't whether to invest in elicitation — it's how much capability you're leaving on the table by not doing so.
- https://www.aisi.gov.uk/blog/our-approach-to-ai-capability-elicitation
- https://arxiv.org/html/2502.02180
- https://arxiv.org/abs/2603.18507
- https://www.nature.com/articles/s41467-026-70873-7
- https://arxiv.org/abs/2212.03827
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5285532
- https://arxiv.org/abs/2201.11903
