Skip to main content

87 posts tagged with "prompt-engineering"

View all tags

Capability Elicitation: Getting Models to Use What They Already Know

· 8 min read
Tian Pan
Software Engineer

Most teams debugging a bad LLM output reach for the same fix: rewrite the prompt. Add more instructions. Clarify the format. Maybe throw in a few examples. This is prompt engineering in its most familiar form — making instructions clearer so the model understands what you want.

But there's a different failure mode that better instructions can't fix. Sometimes the model has the knowledge and can perform the reasoning, but your prompt doesn't activate it. The model isn't confused about your instructions — it's failing to retrieve and apply capabilities it demonstrably possesses.

This is the domain of capability elicitation. Understanding the difference between "the model can't do this" and "my prompt doesn't trigger it" will change how you debug every AI system you build.

Capability Elicitation vs. Prompt Engineering: Your Model Already Knows the Answer

· 9 min read
Tian Pan
Software Engineer

Most prompt engineering advice focuses on the wrong problem. Teams spend weeks refining instruction clarity — adding examples, adjusting tone, restructuring formats — when the actual bottleneck is that the model fails to activate knowledge it demonstrably possesses. The distinction matters: prompt engineering tells a model what to do, while capability elicitation gets a model to use what it already knows.

This isn't a semantic quibble. The UK's AI Safety Institute found that proper elicitation techniques can improve model performance by an amount equivalent to increasing training compute by five to twenty times. That's not a marginal gain from better wording. That's an entire capability tier sitting dormant inside models you're already paying for.

Schema-Driven Prompt Design: Letting Your Data Model Drive Your Prompt Structure

· 10 min read
Tian Pan
Software Engineer

Your data schema is your prompt. Most engineers treat these as separate concerns — you design your database schema to satisfy normal form rules, and you design your prompts to be clear and descriptive. But the shape of your entity schema has a direct, measurable effect on LLM output quality, and ignoring this relationship is one of the most expensive mistakes in production AI systems.

A team at a mid-sized e-commerce company discovered this when their product extraction pipeline started generating hallucinated model years. The fix wasn't better prompting. It was changing {"model": {"type": "string"}} to a field with an explicit description and a regex constraint. That single schema change — documented in the PARSE research — drove accuracy improvements of up to 64.7% on their extraction benchmark.

The Instruction-Following Cliff: Why Adding One More Rule to Your System Prompt Breaks Three Others

· 7 min read
Tian Pan
Software Engineer

Your system prompt started at twelve lines. It worked beautifully. Then product wanted tone guidelines. Legal needed a disclaimer rule. The safety team added three more constraints. Now you're at forty rules and the model ignores half of them — but not the same half each time.

This is the instruction-following cliff: the point where adding one more rule to your prompt doesn't just degrade that rule's compliance — it destabilizes rules that were working fine yesterday. And unlike most engineering failures, this one is maddeningly non-deterministic.

Prompt Sprawl: When System Prompts Grow Into Unmaintainable Legacy Code

· 9 min read
Tian Pan
Software Engineer

Your system prompt started at 200 tokens. A clear role definition, a few formatting rules, a constraint or two. Six months later it's 4,000 tokens of accumulated instructions, half contradicting each other, and nobody on the team can explain why the third paragraph about JSON formatting exists. Welcome to prompt sprawl — the production problem that silently degrades your LLM application while everyone assumes the prompt is "fine."

Prompt sprawl is what happens when you treat prompts like append-only configuration. Every bug gets a new instruction. Every edge case gets a new rule. Every stakeholder gets a new paragraph. The prompt grows, and nobody removes anything because nobody knows what's load-bearing.

This is legacy code — except worse. No compiler catches contradictions. No type system enforces structure. No test suite validates that the 47th instruction doesn't negate the 12th. And unlike a tangled codebase, you can't refactor safely because there's no dependency graph to guide you.

CI/CD for LLM Applications: Why Deploying a Prompt Is Nothing Like Deploying Code

· 10 min read
Tian Pan
Software Engineer

Your code ships through a pipeline: feature branch → pull request → automated tests → staging → production. Every step is gated. Nothing reaches users without passing the checks you've defined. It's boring in the best way.

Now imagine you need to update a system prompt. You edit the string in your dashboard, hit save, and the change is live immediately — no tests, no staging, no diff in version control, no way to roll back except by editing it back by hand. This is how most teams operate, and it's the reason prompt changes are the primary source of unexpected production outages for LLM applications.

The challenge isn't that teams are careless. It's that the discipline of continuous delivery was built for deterministic systems, and LLMs aren't deterministic. The entire mental model needs to be rebuilt from scratch.

Prompt Versioning in Production: The Engineering Discipline Teams Learn the Hard Way

· 10 min read
Tian Pan
Software Engineer

You get paged at 2am. Users are reporting garbage output. You SSH in, check logs, stare at traces — everything looks structurally fine. The model is responding. Latency is normal. But something is wrong with the answers. Then the question lands in your incident channel: "Which prompt version is actually running right now?"

If you can't answer that question in under thirty seconds, you have a prompt versioning problem.

Prompts are treated like configuration in most early-stage LLM projects. A product manager edits a string in a .env file, a developer pastes an updated instruction into a hardcoded constant, someone else pastes a slightly different version into a staging Slack channel. Eventually the versions diverge, and nobody has a complete picture of what's running where. The experimentation-phase casualness that got you to launch becomes a liability the moment you have real users.

Fine-Tuning Is Usually the Wrong Move: A Decision Framework for LLM Customization

· 9 min read
Tian Pan
Software Engineer

Most engineering teams building LLM products follow the same progression: prompt a base model, hit a performance ceiling, and immediately reach for fine-tuning as the solution. This instinct is wrong more often than it's right.

Fine-tuning is a powerful tool. It can unlock real performance gains, cut inference costs at scale, and give you precise control over model behavior. But it carries hidden costs — in data, time, infrastructure, and ongoing maintenance — that teams systematically underestimate. And in many cases, prompt engineering or retrieval augmentation would have gotten them there faster and cheaper.

This post gives you a concrete framework for when each approach wins, grounded in recent benchmarks and production experience.

Prompt Caching: The Optimization That Cuts LLM Costs by 90%

· 7 min read
Tian Pan
Software Engineer

Most teams building on LLMs are overpaying by 60–90%. Not because they're using the wrong model or prompting inefficiently — but because they're reprocessing the same tokens on every single request. Prompt caching fixes this, and it takes about ten minutes to implement. Yet it remains one of the most underutilized optimizations in production LLM systems.

Here's what's happening: every time you send a request to an LLM API, the model runs attention over every token in your prompt. If your system prompt is 10,000 tokens and you're handling 1,000 requests per day, you're paying to process 10 million tokens daily just for the static part of your prompt — context that never changes. Prompt caching stores the intermediate computation (the key-value attention states) so subsequent requests can skip that work entirely.

Prompt Versioning and Change Management in Production AI Systems

· 9 min read
Tian Pan
Software Engineer

A team added three words to a customer service prompt to make it "more conversational." Within hours, structured-output error rates spiked and a revenue-generating pipeline stalled. Engineers spent most of a day debugging infrastructure and code before anyone thought to look at the prompt. There was no version history. There was no rollback. The three-word change had been made inline, in a config file, by a product manager who had no reason to think it was risky.

This is the canonical production prompt incident. Variations of it play out at companies of every size, and the root cause is almost always the same: prompts were treated as ephemeral configuration instead of software.

Context Engineering: The Discipline That Matters More Than Prompting

· 9 min read
Tian Pan
Software Engineer

Most engineers building LLM systems spend the first few weeks obsessing over their prompts. They A/B test phrasing, argue about whether to use XML tags or JSON, and iterate on system prompt wording until the model outputs something that looks right. Then they hit production, add real data, memory, and tool calls — and the model starts misbehaving in ways that no amount of prompt tuning can fix. The problem was never the prompt.

The real bottleneck in production LLM systems is context — what information is present in the model's input, in what order, how much of it there is, and whether it's relevant to the decision the model is about to make. Context engineering is the discipline of designing and managing that input space as a first-class system concern. It subsumes prompt engineering the same way software architecture subsumes variable naming: the smaller skill still matters, but it doesn't drive outcomes at scale.

Your CLAUDE.md Is Probably Too Long (And That's Why It's Not Working)

· 10 min read
Tian Pan
Software Engineer

Here's a pattern that plays out constantly in teams adopting AI coding agents: a developer has Claude disobey a rule, so they add a clearer version to their CLAUDE.md. Claude disobeys a different rule, so they add that one too. After a few weeks, the file is 400 lines long and Claude is ignoring more rules than ever. The solution made the problem worse.

This happens because of a fundamental property of instruction files that most developers never internalize: past a certain size, adding more instructions causes the model to follow fewer of them. Getting instruction files right is less about completeness and more about ruthless selection — knowing what to include, what to cut, and how to architect the rest.