Skip to main content

Capability Elicitation vs. Prompt Engineering: Getting Models to Use What They Already Know

· 8 min read
Tian Pan
Software Engineer

Most teams optimizing their LLM prompts are solving the wrong problem. They spend weeks refining instruction clarity — tweaking wording, reordering constraints, adjusting tone — when the real bottleneck is that the model already knows how to solve the task but their prompt never triggers the right capability.

This is the difference between prompt engineering and capability elicitation. Prompt engineering is about communicating what you want. Capability elicitation is about activating what the model can already do. The distinction matters because the fixes are completely different, and misdiagnosing which problem you have wastes months of iteration on the wrong lever.

The Knowledge Is There — Your Prompt Doesn't Reach It

Here's a pattern every practitioner has seen: you ask a model a question and get a mediocre answer. You rephrase the question with more context, and suddenly the response is dramatically better — not because you provided new information, but because you activated a different region of the model's learned representations.

This isn't a quirk. It's a fundamental property of how large language models work. These models compress vast amounts of knowledge during training, but that knowledge is organized associatively, not hierarchically. The path from your prompt to the relevant knowledge depends on which associations your input activates. A slightly different phrasing can route through entirely different internal representations.

Research on provision-based versus elicitation-based prompt optimization makes this concrete. When researchers tested elicitation methods — techniques that try to unlock knowledge already in the model — they found that optimized prompts incorporated fewer than 15 domain-specific key points, producing learning gains below 20%. The prompts achieved better validation scores through pattern matching while failing to resolve underlying knowledge deficiencies. In other words, elicitation methods hit a ceiling when the knowledge genuinely isn't there.

The critical diagnostic question is: does the model lack the knowledge, or does the model lack the activation path? Getting this wrong sends you down a costly dead end.

Three Elicitation Techniques That Actually Work

When the knowledge is present but dormant, three techniques consistently outperform standard prompt engineering.

Structured Decomposition

Instead of asking the model to solve a complex problem in one shot, you break the problem into sub-problems that each activate a different knowledge domain. Chain-of-thought prompting is the most famous example — adding "let's think step by step" to math problems improved accuracy from near-random to state-of-the-art on benchmarks like GSM8K. But structured decomposition goes beyond chain-of-thought.

The key insight is that less capable models can match more capable ones when given proper decomposition scaffolding. A 540-billion-parameter model with chain-of-thought exemplars achieved accuracy that smaller models couldn't reach regardless of how many examples they were given. But structured decomposition also helps smaller models punch above their weight by routing each sub-problem through the model's strongest relevant capability.

The practical pattern: instead of "analyze this system's failure modes," try "first, list the components and their dependencies; second, for each component, describe what happens when it fails; third, identify which failures cascade." Each step activates a different capability — taxonomy, causal reasoning, and graph analysis — rather than hoping a single prompt activates all three simultaneously.

Analogical Priming

Research published in Nature Communications demonstrated that human-provided analogical guidance amplified LLM performance by up to 10x on certain tasks. The mechanism is striking: structured guidance activates latent model capabilities via analogical bridges that the model wouldn't discover autonomously.

This works because analogies create activation shortcuts. When you say "think of this distributed system problem like a traffic routing problem," you're not teaching the model about distributed systems. You're connecting its distributed-systems knowledge to its traffic-routing knowledge, giving it access to solution patterns it wouldn't otherwise retrieve.

The practical application is to identify which domain the model handles most naturally for a given problem structure, then frame your actual problem as an analogy to that domain. Database consistency problems map well to banking transaction analogies. Concurrency issues map to restaurant kitchen coordination. The model knows both domains — the analogy just builds the bridge.

Expertise Framing (With Caveats)

"You are an expert in X" is probably the most common elicitation technique in production systems. But recent research reveals it's far more nuanced than practitioners assume.

Persona prompting improves performance on extraction tasks (+0.65 score), STEM tasks (+0.60), and reasoning tasks (+0.40). But it actively degrades performance on math, coding, and factual knowledge tasks. On the MMLU benchmark, accuracy dropped from 71.6% baseline to 66.3% with detailed expert personas — a meaningful regression.

The mechanism: expert personas activate the model's instruction-following mode, which prioritizes sounding authoritative over being accurate. For tasks where tone and structure matter (writing, summarization, extraction), this helps. For tasks where precision matters (math, factual recall, code generation), it hurts.

The fix isn't to avoid expertise framing entirely — it's to use it conditionally. Frame the model as an expert when you need better formatting, structure, or domain-appropriate vocabulary. Switch to neutral prompting when verification and accuracy are the priority.

When Elicitation Fails: The Provision Alternative

The most important thing about elicitation is knowing when to stop trying. If the model doesn't have the knowledge, no amount of clever prompting will create it.

Research on knowledge-intensive tasks across financial, legal, and medical domains showed provision-based approaches — directly injecting domain knowledge into the prompt — achieved 28.3% learning gains compared to 7.5% for elicitation methods. On specialized medical benchmarks where baseline accuracy was 28.3%, provision-based optimization reached 51.7%, far exceeding what any elicitation technique could achieve.

This creates a practical diagnostic framework:

  • Try elicitation first when the task uses general knowledge the model likely encountered during training. Restructure the problem, add analogies, use decomposition.
  • Switch to provision when elicitation plateaus. If three different elicitation approaches produce similar mediocre results, the model probably lacks the knowledge rather than the activation path. Inject the domain knowledge directly via examples, reference material, or retrieval-augmented generation.
  • Combine both for production systems. Use elicitation techniques to activate the model's general capabilities, then provision domain-specific knowledge that fills the gaps.

The worst outcome is spending weeks refining prompts that try to elicit knowledge the model simply doesn't have, when you could have provided that knowledge directly on day one.

The Invisible Capability Gap

Every prompt you send to an LLM is underspecified. You carry implicit assumptions and unstated context that the model fills in using its own understanding. When its fill-in matches your expectations, the system feels like magic. When it doesn't, the system feels broken.

Capability gaps are invisible until you step on them. A model might handle 95% of your use cases perfectly, then fail catastrophically on the 5% where your unstated assumptions diverge from the model's. This isn't a prompt engineering problem — it's an elicitation problem. The model may have the capability to handle those edge cases, but your prompt doesn't activate the right representation.

The practical defense is systematic failure categorization. When your model fails, classify the failure:

  • Spurious bottleneck: The model has the capability but something trivial blocks it — formatting issues, unnecessary refusals, incorrect assumptions about output structure. Fix these at prompt time.
  • Real bottleneck: The model genuinely lacks the capability or knowledge. No amount of elicitation will fix this. Provision the knowledge or switch to a more capable model.
  • Tradeoff: Fixing this failure would break something else. These are the hardest to handle and require careful evaluation across your full task distribution.

This categorization framework, adapted from evaluation methodology used in capability assessments, prevents the most common mistake: treating every failure as a prompt engineering problem when many are actually elicitation problems or genuine capability limits.

The Decreasing Returns of Clever Prompting

As models improve, the marginal value of sophisticated prompting techniques decreases. Chain-of-thought prompting, which was transformative for earlier models, shows diminishing returns on newer architectures that have internalized step-by-step reasoning during training.

This suggests a practical strategy shift. For current-generation models, invest less time in prompt tricks and more time in:

  • Understanding what your model actually knows through systematic probing across your task distribution
  • Building clean activation paths through structured inputs rather than instruction-heavy prompts
  • Providing missing knowledge directly rather than hoping clever phrasing will conjure it

The teams that ship reliable AI features aren't the ones with the cleverest prompts. They're the ones who correctly diagnose whether each failure is an elicitation problem, a knowledge problem, or a genuine capability limit — and apply the right fix accordingly. That diagnostic skill is worth more than any prompting technique.

References:Let's stay in touch and Follow me for more thoughts and updates