Skip to main content

136 posts tagged with "prompt-engineering"

View all tags

The System Prompt Is a Software Interface, Not a Config String

· 9 min read
Tian Pan
Software Engineer

Most teams treat their system prompts the way early web developers treated CSS: paste something that works, modify it carefully to not break anything, commit it to a config file, and hope nobody touches it. Then a new team member "cleans it up," a model upgrade subtly changes behavior, and three weeks later a user files a bug that nobody can reproduce because nobody knows what the prompt actually said last Tuesday.

This isn't a workflow problem. It's a category error. System prompts aren't configuration — they're software interfaces. And until engineering teams treat them as such, the LLM features they build will remain fragile, hard to debug, and impossible to scale.

Your Prompts Are Configuration: Treating AI Settings as Production Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most engineering teams can tell you exactly which environment variable controls their database connection pool. Almost none can tell you which system prompt version is serving 90% of their traffic right now — or what changed since the last model behavior complaint rolled in.

This is the AI configuration footprint problem. Teams building LLM-powered features accumulate an implicit configuration layer — model selection, sampling parameters, system prompts, tool schemas, retry budgets — that governs how their product behaves in production. Most of this layer lives in no system of record. It gets updated through direct code edits, spreadsheet hand-offs, or Slack messages. When something breaks, nobody can say what changed.

That's not a process problem. It's an architecture problem. And the fix requires treating AI configuration with the same rigor that mature teams bring to environment config, feature flags, and infrastructure-as-code.

Behavioral Cloning for System Prompts: Preserving Expert Judgment Before It Walks Out the Door

· 9 min read
Tian Pan
Software Engineer

Your best system prompt was written by someone who no longer works here.

That sentence lands differently depending on where you sit in the organization. If you're an engineer who inherited an undocumented 3,000-token prompt that governs a production AI feature, you've already lived this. You've stared at a clause like "Do not include supplementary data unless context warrants it" and had no idea what "context" means, what triggered this rule, or whether removing it would cause a 5% quality improvement or a catastrophic regression. If you're a team lead, you've watched institutional knowledge walk out the door every time a senior engineer or prompt specialist changes jobs — and that knowledge didn't go into the documentation because nobody knew there was anything to document.

This is the system prompt knowledge problem, and it's worse than most teams realize. The fix borrows an idea from robotics research and applies it to a deeply human engineering challenge: behavioral cloning — capturing what an expert does, and why, before they're no longer there to ask.

Dynamic System Prompt Assembly: Composable AI Behavior at Request Time

· 10 min read
Tian Pan
Software Engineer

Most teams start with a single, monolithic system prompt. It works fine in demos. Then the product grows: you add a power user tier, a compliance mode for enterprise customers, a new tool the model can call, and a feature-flag experiment your growth team wants to A/B test. You add all of that to the same prompt. Six months in, you have 4,000 words of instructions that nobody fully understands, behavior that changes unpredictably when you edit one section, and a debugging process that amounts to "change something and see what happens."

The answer most teams reach for is composable, dynamically assembled system prompts — building the prompt from modular components at request time rather than maintaining a static text file. It's a sound architectural instinct, but the implementation surface is larger than it looks. Composable prompts introduce a new class of failure modes that static prompts simply don't have.

Why Your AI Sounds Wrong Even When It's Technically Correct

· 9 min read
Tian Pan
Software Engineer

A logistics chatbot received a message from a customer whose shipment had been lost for a week. The reply came back: "I'm not trained to care about that." Factually accurate. The system had correctly parsed the query, correctly identified that it lacked routing to address the issue, and correctly communicated its limitation. The answer was technically correct in every measurable sense. It was also a product disaster.

This is the register problem — and it's the failure mode your evals almost certainly aren't measuring.

The Words You Choose in Your System Prompt Change What Your Agent Will Risk

· 8 min read
Tian Pan
Software Engineer

Here is something that shouldn't be surprising but is: when you tell an agent "avoid making mistakes" versus "prioritize accuracy," you are not giving it the same instruction. The observable behavior at ambiguous decision points diverges measurably — agents prompted with loss-avoidance framing hedge more, escalate more, and complete fewer tasks end-to-end. Agents prompted with gain-seeking framing complete more tasks but introduce more errors. The difference isn't philosophical; it shows up in eval logs.

This is the behavioral economics of agents, and most engineering teams haven't thought about it systematically. They write system prompts as documentation — a description of what the agent is — when system prompts are actually decision-shaping instruments that encode a risk posture whether the author intended that or not.

The Prompt Versioning Problem: Why Your Prompt Changes Are Untracked Production Risks

· 11 min read
Tian Pan
Software Engineer

Most teams treat a prompt change the way they treated a config file change in 2008: edit the string, redeploy, hope for the best. No version tag, no test suite, no rollback plan. The difference is that a config file change rarely alters the semantic behavior of your entire product — a prompt change almost always does.

If you've shipped a customer-facing LLM feature, you've probably already done this: edited a system prompt to "improve" the tone, deployed it alongside an unrelated bug fix, and had no idea three weeks later why user satisfaction dropped. The prompt was the culprit. You had no way to know.

The Zero-Shot Wall: Why In-Context Examples Stop Working at Production Scale

· 8 min read
Tian Pan
Software Engineer

Most teams discover the zero-shot wall the same way: a new edge case breaks the model, they add an example to the prompt, it helps. Three months later they've got 40 examples, 6,000 tokens of context, the performance metrics haven't moved in weeks, and the prompt engineer who knows where every example came from just left the company.

Few-shot prompting is seductive because it works quickly. You observe a failure, you add a demonstration, the failure goes away. The feedback loop is tight and the wins feel free. What you don't notice is that each subsequent example is buying less than the last — and at some point you're spending tokens, latency, and cognitive overhead for improvements that round to zero.

This is the zero-shot wall: not a hard limit where performance drops off a cliff, but a zone of sharply diminishing returns where in-context learning has hit the ceiling of what it can accomplish for your task, and the only lever left is fine-tuning.

The 200-Token System Prompt That Beats Your 4000-Token One

· 10 min read
Tian Pan
Software Engineer

A team I worked with spent six months tuning a system prompt to roughly 4,000 tokens. It was their crown jewel — a careful accretion of edge-case handling, formatting rules, persona instructions, fallback behaviors, and a dozen few-shot examples. Then a junior engineer joined, asked why the prompt was so long, and rewrote it in an afternoon. The new version was 200 tokens. On their existing eval suite it scored four points higher. It was also forty times cheaper to run, and noticeably faster.

This is not an anecdote about a magic short prompt. It is a pattern I see almost every time I read a production system prompt that has lived past its first quarter. Long prompts grow by accretion, not by design. Every failure mode that surfaced in QA contributed a paragraph. Every stakeholder who watched a demo contributed a tone instruction. Every example that "seemed to help" got pinned to the bottom. The result is a prompt that is longer than the user input it is meant to instruct, full of internal contradictions the model has to silently resolve at inference time, with attention spread thinly across competing demands.

Personalization Belongs in a Dotfile, Not a Vector Store

· 12 min read
Tian Pan
Software Engineer

The first time a product team needs per-user agent behavior, somebody usually says "we should fine-tune" or "let's wire up persistent memory." A week later they have a vector database, a feedback-loop pipeline, and a roadmap item to monitor learned-state drift. They have built an ML system to solve a problem that, in nine cases out of ten, is a config file.

Look at what users are actually asking for: terser responses, bullets instead of prose, my company's name in the disclaimer, default to my preferred model, don't escalate to a human under $100, here is the project I am working on this week, never use emoji. None of that needs a model that has learned anything. It needs settings. The dotfile pattern — a versioned, declarative, per-user configuration repo — solved this for shells, editors, and CLIs forty years ago, and it is the right shape for AI agents in 2026.

The Eval Migration Tax: Why a Prompt Schema Change Wrecks 800 Test Cases

· 11 min read
Tian Pan
Software Engineer

Every AI team I've watched ship a "small" output schema change has lived through the same week. Someone renames a field in the system prompt — say, summary becomes tldr, or the tool catalog gains a required confidence parameter — and the next CI run lights up red across 800 eval cases that have nothing to do with the change. The prompt diff is fifteen lines. The eval diff is a four-day migration project nobody scoped, owned, or budgeted.

This is the eval migration tax. It is the maintenance cost no roadmap accounts for, paid in delayed releases that get blamed on "flaky tests" rather than the architectural choice that actually caused them. Most teams pay it for years before they recognize the pattern, because each individual incident looks like ordinary churn. The compounding only becomes visible when you tally the engineering hours spent migrating evals across a quarter and realize they exceed the hours spent improving the model behavior the evals were supposed to measure.

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.