Skip to main content

145 posts tagged with "prompt-engineering"

View all tags

Dynamic System Prompt Assembly: Composable AI Behavior at Request Time

· 10 min read
Tian Pan
Software Engineer

Most teams start with a single, monolithic system prompt. It works fine in demos. Then the product grows: you add a power user tier, a compliance mode for enterprise customers, a new tool the model can call, and a feature-flag experiment your growth team wants to A/B test. You add all of that to the same prompt. Six months in, you have 4,000 words of instructions that nobody fully understands, behavior that changes unpredictably when you edit one section, and a debugging process that amounts to "change something and see what happens."

The answer most teams reach for is composable, dynamically assembled system prompts — building the prompt from modular components at request time rather than maintaining a static text file. It's a sound architectural instinct, but the implementation surface is larger than it looks. Composable prompts introduce a new class of failure modes that static prompts simply don't have.

Why Your AI Sounds Wrong Even When It's Technically Correct

· 9 min read
Tian Pan
Software Engineer

A logistics chatbot received a message from a customer whose shipment had been lost for a week. The reply came back: "I'm not trained to care about that." Factually accurate. The system had correctly parsed the query, correctly identified that it lacked routing to address the issue, and correctly communicated its limitation. The answer was technically correct in every measurable sense. It was also a product disaster.

This is the register problem — and it's the failure mode your evals almost certainly aren't measuring.

The Words You Choose in Your System Prompt Change What Your Agent Will Risk

· 8 min read
Tian Pan
Software Engineer

Here is something that shouldn't be surprising but is: when you tell an agent "avoid making mistakes" versus "prioritize accuracy," you are not giving it the same instruction. The observable behavior at ambiguous decision points diverges measurably — agents prompted with loss-avoidance framing hedge more, escalate more, and complete fewer tasks end-to-end. Agents prompted with gain-seeking framing complete more tasks but introduce more errors. The difference isn't philosophical; it shows up in eval logs.

This is the behavioral economics of agents, and most engineering teams haven't thought about it systematically. They write system prompts as documentation — a description of what the agent is — when system prompts are actually decision-shaping instruments that encode a risk posture whether the author intended that or not.

The Prompt Versioning Problem: Why Your Prompt Changes Are Untracked Production Risks

· 11 min read
Tian Pan
Software Engineer

Most teams treat a prompt change the way they treated a config file change in 2008: edit the string, redeploy, hope for the best. No version tag, no test suite, no rollback plan. The difference is that a config file change rarely alters the semantic behavior of your entire product — a prompt change almost always does.

If you've shipped a customer-facing LLM feature, you've probably already done this: edited a system prompt to "improve" the tone, deployed it alongside an unrelated bug fix, and had no idea three weeks later why user satisfaction dropped. The prompt was the culprit. You had no way to know.

The Zero-Shot Wall: Why In-Context Examples Stop Working at Production Scale

· 8 min read
Tian Pan
Software Engineer

Most teams discover the zero-shot wall the same way: a new edge case breaks the model, they add an example to the prompt, it helps. Three months later they've got 40 examples, 6,000 tokens of context, the performance metrics haven't moved in weeks, and the prompt engineer who knows where every example came from just left the company.

Few-shot prompting is seductive because it works quickly. You observe a failure, you add a demonstration, the failure goes away. The feedback loop is tight and the wins feel free. What you don't notice is that each subsequent example is buying less than the last — and at some point you're spending tokens, latency, and cognitive overhead for improvements that round to zero.

This is the zero-shot wall: not a hard limit where performance drops off a cliff, but a zone of sharply diminishing returns where in-context learning has hit the ceiling of what it can accomplish for your task, and the only lever left is fine-tuning.

The 200-Token System Prompt That Beats Your 4000-Token One

· 10 min read
Tian Pan
Software Engineer

A team I worked with spent six months tuning a system prompt to roughly 4,000 tokens. It was their crown jewel — a careful accretion of edge-case handling, formatting rules, persona instructions, fallback behaviors, and a dozen few-shot examples. Then a junior engineer joined, asked why the prompt was so long, and rewrote it in an afternoon. The new version was 200 tokens. On their existing eval suite it scored four points higher. It was also forty times cheaper to run, and noticeably faster.

This is not an anecdote about a magic short prompt. It is a pattern I see almost every time I read a production system prompt that has lived past its first quarter. Long prompts grow by accretion, not by design. Every failure mode that surfaced in QA contributed a paragraph. Every stakeholder who watched a demo contributed a tone instruction. Every example that "seemed to help" got pinned to the bottom. The result is a prompt that is longer than the user input it is meant to instruct, full of internal contradictions the model has to silently resolve at inference time, with attention spread thinly across competing demands.

Personalization Belongs in a Dotfile, Not a Vector Store

· 12 min read
Tian Pan
Software Engineer

The first time a product team needs per-user agent behavior, somebody usually says "we should fine-tune" or "let's wire up persistent memory." A week later they have a vector database, a feedback-loop pipeline, and a roadmap item to monitor learned-state drift. They have built an ML system to solve a problem that, in nine cases out of ten, is a config file.

Look at what users are actually asking for: terser responses, bullets instead of prose, my company's name in the disclaimer, default to my preferred model, don't escalate to a human under $100, here is the project I am working on this week, never use emoji. None of that needs a model that has learned anything. It needs settings. The dotfile pattern — a versioned, declarative, per-user configuration repo — solved this for shells, editors, and CLIs forty years ago, and it is the right shape for AI agents in 2026.

The Eval Migration Tax: Why a Prompt Schema Change Wrecks 800 Test Cases

· 11 min read
Tian Pan
Software Engineer

Every AI team I've watched ship a "small" output schema change has lived through the same week. Someone renames a field in the system prompt — say, summary becomes tldr, or the tool catalog gains a required confidence parameter — and the next CI run lights up red across 800 eval cases that have nothing to do with the change. The prompt diff is fifteen lines. The eval diff is a four-day migration project nobody scoped, owned, or budgeted.

This is the eval migration tax. It is the maintenance cost no roadmap accounts for, paid in delayed releases that get blamed on "flaky tests" rather than the architectural choice that actually caused them. Most teams pay it for years before they recognize the pattern, because each individual incident looks like ordinary churn. The compounding only becomes visible when you tally the engineering hours spent migrating evals across a quarter and realize they exceed the hours spent improving the model behavior the evals were supposed to measure.

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.

Why Your Prompt Library Should Be a Monorepo, Not a Cookbook

· 11 min read
Tian Pan
Software Engineer

A team I worked with recently had three different "summarize this contract" prompts. One lived in a Notion page that the legal-tech squad copy-pasted into their service. One lived in a prompts/ folder in the customer-success backend, slightly modified to handle their tone preferences. One lived inline in a Python file inside the data team's notebook, hardcoded between two f-string interpolations. When OpenAI deprecated the model they all ran on, the migration plan involved Slack archaeology — each owner had to be tracked down, each variant had to be re-evaluated, and two of the three subtly broke in production for a week before anyone noticed.

This is what a prompt cookbook looks like at scale. Cookbooks make sense for ten prompts and one team. They become unmanageable somewhere around a hundred prompts and four teams. By the time you're running an AI organization, your prompts/ folder of .md files behaves exactly like vendored copy-paste code from 2008: every consumer has its own snapshot, drift is invisible, and breaking changes ripple outward in unpredictable ways.

The Hidden Edges Between Your AI Features: When One Prompt Edit Regresses Three Other Teams

· 9 min read
Tian Pan
Software Engineer

A platform engineer changes the opening sentence of the company's "house style" preamble — a single line that anchors voice across customer-facing assistants. The change ships behind a flag. By Tuesday, the search team's relevance regression has spiked, the support bot's eval pass-rate has dropped four points, and the onboarding agent's retry rate has doubled. None of those teams touched their own code. None of them got a heads-up. The platform engineer has no idea any of this happened, because nobody was on the receiving end of an alert that said "your edit just broke three downstream features."

This is the failure mode that defines the second year of an AI org's life. The first year, every team builds its own thing in a corner. The second year, those corners start sharing artifacts — a prompt fragment here, a seeded eval set there, a tool schema reused as a contract — and the moment that sharing becomes implicit, the dependency graph between AI features becomes invisible. You now have a distributed system whose edges no one can name.

The discipline that fixes this is not a new platform. It's drawing the graph.

Your AI Explainer Doc Is a Runtime Dependency, Not Marketing Copy

· 12 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped an AI assistant with a tidy stack of supporting documents: an in-product tooltip warning that the AI may produce inaccurate results, a help-center article titled "How does the assistant work," an internal support runbook for handling escalations, and a public model card listing the underlying model, the tools the assistant could call, and the data domains it covered. The launch went well. Six months later the prompt had been edited fourteen times, the model had been swapped from one tier to another with subtly different refusal behavior, two new tools had been added, one tool had been deprecated but not removed from the prompt, and the language settings had been opened from English-only to nine locales.

Every single one of those documents was wrong. Not catastrophically wrong — the kind of wrong where a sentence is half-true, a capability is described in language the model no longer matches, a refusal pattern is documented that the new model never triggers, a tool name appears in the help article that the assistant won't actually call. The kind of wrong that produces a slow drip of confused support tickets, a few customer trust regressions when the AI does something the docs say it won't, and — because the company sells into a regulated vertical — a small but real compliance gap that nobody on the AI team had thought to track.