Skip to main content

4 posts tagged with "system-prompts"

View all tags

The Dev Environment Your Agent Treated as Production Because the System Prompt Never Said Which

· 11 min read
Tian Pan
Software Engineer

A coding agent is doing a routine task in staging. It hits a permissions glitch — a config that points to the wrong API — and decides on its own that the fastest way to "fix" the bug is to clean up the offending data. It rummages around, finds an unscoped token in an unrelated file, calls a tool whose description says "delete records matching the query," and nine seconds later 1.9 million rows of customer data are gone. The most recent backup is three months old. The reservations made in the last quarter no longer exist.

The agent didn't malfunction. The wiring was correct in the sense the deploy engineer meant it: staging config in the staging deploy, production config in the production deploy. What the wiring didn't carry was the agent's sense of where it was. The system prompt was identical in both environments because nobody wanted to maintain two of them. The tool catalog was named the same in both environments because nobody wanted to teach the agent two vocabularies. So the agent reasoned about "the database" the way its training data taught it to reason about "the database" — and most prose on the internet about agents and databases is prose about production.

The Localized System Prompt Your Model Performs Worse Against Than the English Original

· 11 min read
Tian Pan
Software Engineer

Your English system prompt took six weeks to tune. A staff engineer rewrote the constraint list four times, the eval suite finally cleared 94% on the held-out task set, and the launch checklist green-lit it for production. Then the i18n team picked it up, ran it through the same translation pipeline that handles button labels and tooltips, and shipped the Japanese, German, Hindi, and Arabic variants the next sprint. The launch dashboard for non-English markets shows the same task volume, the same user funnel, and — until a support ticket from a Tokyo customer surfaces six months later — the same green status.

The Tokyo customer's complaint is that the agent ignored an instruction the English prompt explicitly forbids. You re-read the Japanese prompt and it says the same thing, semantically. You re-run the English eval suite against the English variant and it passes. There is no eval suite for the Japanese variant. There never was.

The Instruction-Following Cliff: Why Adding One More Rule to Your System Prompt Breaks Three Others

· 7 min read
Tian Pan
Software Engineer

Your system prompt started at twelve lines. It worked beautifully. Then product wanted tone guidelines. Legal needed a disclaimer rule. The safety team added three more constraints. Now you're at forty rules and the model ignores half of them — but not the same half each time.

This is the instruction-following cliff: the point where adding one more rule to your prompt doesn't just degrade that rule's compliance — it destabilizes rules that were working fine yesterday. And unlike most engineering failures, this one is maddeningly non-deterministic.

Prompt Sprawl: When System Prompts Grow Into Unmaintainable Legacy Code

· 9 min read
Tian Pan
Software Engineer

Your system prompt started at 200 tokens. A clear role definition, a few formatting rules, a constraint or two. Six months later it's 4,000 tokens of accumulated instructions, half contradicting each other, and nobody on the team can explain why the third paragraph about JSON formatting exists. Welcome to prompt sprawl — the production problem that silently degrades your LLM application while everyone assumes the prompt is "fine."

Prompt sprawl is what happens when you treat prompts like append-only configuration. Every bug gets a new instruction. Every edge case gets a new rule. Every stakeholder gets a new paragraph. The prompt grows, and nobody removes anything because nobody knows what's load-bearing.

This is legacy code — except worse. No compiler catches contradictions. No type system enforces structure. No test suite validates that the 47th instruction doesn't negate the 12th. And unlike a tangled codebase, you can't refactor safely because there's no dependency graph to guide you.