Skip to main content

158 posts tagged with "prompt-engineering"

View all tags

The Quality Tax of Over-Specified System Prompts

· 9 min read
Tian Pan
Software Engineer

Most engineering teams discover the same thing on their first billing spike: their system prompt has quietly grown to 4,000 tokens of carefully reasoned instructions, and the model has quietly started ignoring half of them. The fix is rarely to add more instructions. It's almost always to delete them.

The instinct to be exhaustive is understandable. More constraints feel like more control. But there's a measurable quality degradation that kicks in as system prompts bloat — and it compounds with cost in ways that aren't visible until they hurt. Research consistently finds accuracy drops at around 3,000 tokens of input, well before hitting any nominal context limit. The model doesn't refuse to comply; it just starts underperforming in ways that are hard to pin down.

This post is about making that degradation visible, understanding why it happens, and building a trimming discipline that doesn't require hoping nothing breaks.

AI Succession Planning: What Happens When the Team That Knows the Prompts Leaves

· 11 min read
Tian Pan
Software Engineer

The engineer who built your customer support AI leaves for another job. On their last day, you do an offboarding interview and ask them to document what they know. They write a few paragraphs explaining how the system works. Six months later, customer satisfaction scores start slipping. Someone suggests tightening the tone of the system prompt. Another engineer makes the edit, runs a few manual tests, and ships it. Three weeks later, you discover that a specific phrasing in the original system prompt was load-bearing in ways nobody knew — it was the only thing preventing the model from over-escalating tickets on Friday afternoons, a pattern the original engineer had noticed and quietly fixed with a single sentence.

No one knew that sentence existed for a reason. It looked like implementation detail. It was actually institutional knowledge.

The Instruction Complexity Cliff: Why LLMs Follow 5 Rules Reliably but Not 15

· 10 min read
Tian Pan
Software Engineer

There's a pattern that shows up in almost every production AI system: the team starts with a focused system prompt, ships the feature, and then iterates. A new edge case surfaces, so they add a rule. Another ticket comes in, another rule. Six months later the system prompt has grown to 2,000 tokens and covers 20 distinct behavioral requirements. The AI still sounds coherent on most requests. But subtle compliance failures have been creeping in for weeks — formatting ignored here, a tone requirement skipped there, an escalation rule quietly bypassed. Nobody flagged it because no individual failure was dramatic enough to page anyone.

This isn't a model quality problem. It's a fundamental architectural characteristic of how transformer-based language models process instructions, and there's a substantial body of empirical research that makes the failure modes predictable. Understanding it changes how you should write system prompts.

The Knowledge Contamination Problem: When Your RAG System Ignores Its Own Retrieval

· 8 min read
Tian Pan
Software Engineer

A team ships a RAG pipeline for internal documentation. Retrieval looks solid — the right passages come back. But in production, users keep getting stale answers. They dig into the logs and find the model is returning facts from its training data, not from the documents it was handed. The retrieval worked. The model just didn't use it.

This is the knowledge contamination problem: the model's parametric memory — the knowledge baked into its weights during training — overrides the retrieved context. It's quiet, it's confident, and it's one of the most common failure modes in production RAG systems.

The Hidden Switching Costs of LLM Vendor Lock-In

· 11 min read
Tian Pan
Software Engineer

Most engineering teams believe they've insulated themselves from LLM vendor lock-in. They use LiteLLM to unify API calls. They avoid fine-tuning on hosted platforms. They keep raw data in their own storage. They feel safe. Then a provider announces a deprecation — or a competitor's pricing drops 40% — and the team discovers that the abstraction layer they built handles roughly 20% of the actual switching cost.

The other 80% is buried in places no one looked: system prompts written around a model's formatting quirks, eval suites calibrated to one model's refusal thresholds, embedding indexes that become incompatible the moment you change models, and user expectations shaped by behavioral patterns that simply don't transfer.

Prompt Cache Break-Even: The Exact Math on When Provider-Side Prefix Caching Actually Pays Off

· 9 min read
Tian Pan
Software Engineer

Prompt caching sounds like a clear win: Anthropic and OpenAI both advertise a 90% discount on cache hits, and the documentation shows impressive cost reduction charts. Teams implement it, monitor the cache hit rate counter going up, and assume they're saving money. Some of them are paying more than if they hadn't cached anything.

The issue is the write premium. Every time you cache a prefix, you pay a surcharge — 1.25× on a 5-minute cache window, 2× for a 1-hour window. If your hit rate is too low, those write premiums accumulate faster than the read discounts recover them. Caching is not free insurance; it's a bet you place against your own traffic patterns.

The Prompt-Model Coupling Trap: Why Your Prompts Only Speak One Model's Dialect

· 10 min read
Tian Pan
Software Engineer

Most prompt migrations look fine in staging. Ninety percent of test cases pass, the new model's responses feel crisper, and the demo runs cleanly. Then you ship, and within two days your structured output parser is throwing exceptions on 12% of responses, a customer-facing classification pipeline started returning wrong labels, and a tool-calling agent is looping on a schema it used to handle without issue. Nobody changed the prompts. The model changed.

This is the prompt-model coupling trap: prompts that work reliably on one model silently accumulate dependencies on that model's specific behavioral quirks, and those dependencies are invisible until migration day.

Schema-First AI Development: Define Output Contracts Before You Write Prompts

· 9 min read
Tian Pan
Software Engineer

Most teams discover the schema problem the wrong way: a downstream service starts returning nonsense, a dashboard fills up with garbage, and a twenty-minute debugging session reveals that the LLM quietly started wrapping its JSON in a markdown code fence three weeks ago. Nobody noticed because the application wasn't crashing — it was silently consuming malformed data.

The fix was a one-line prompt change. The damage was weeks of bad analytics and one very uncomfortable postmortem.

Schema-first development is the discipline that prevents this. It means defining the exact structure your LLM output must conform to — before you write a single prompt token. This isn't about constraining creativity; it's about treating output format as a contract that downstream systems can rely on, the same way you'd version a REST API before writing the consumers.

Stop Writing Prompts by Hand: Automated Optimization with DSPy and MIPRO

· 9 min read
Tian Pan
Software Engineer

You are going to spend an afternoon tuning a prompt. You'll move a sentence around, swap "classify" for "categorize," add a note about edge cases, and run spot-checks against a handful of examples you keep in a notebook. By end of day the prompt is marginally better — you think. You can't prove it. You don't have a reproducible baseline. A week later a colleague changes a few words and the whole thing regresses.

This is the current state of prompt engineering at most teams. DSPy is Stanford's answer to it. Rather than hand-authoring instruction prose, you declare what your LLM program should do, define a metric, and let an optimizer compile the actual prompts for you. MIPRO — the Multi-prompt Instruction PRoposal Optimizer — is the algorithm that makes this approach competitive with (and often better than) the human-crafted alternative.

Context Compression Changes What Your Model Actually Sees

· 11 min read
Tian Pan
Software Engineer

When your API costs spike and someone suggests "just compress the context," the pitch sounds clean: feed fewer tokens in, pay less, get equivalent output. LLMLingua benchmarks show 20x compression on math reasoning with only 1.5% accuracy loss. What's not to like?

The problem is that those benchmarks measure what the compressed context scores on clean, curated test sets. They don't measure what happens when your agent quietly drops the constraint it was given three turns ago, or resolves a pronoun to the wrong entity, or confabulates an exact file path because the original tool output was summarized away. Context compression doesn't just reduce tokens — it changes what your model actually sees. And the gaps between the original context and the compressed version are reliably where your system will fail.

Contract Tests for Prompts: Stop One Team's Edit From Breaking Another Team's Agent

· 9 min read
Tian Pan
Software Engineer

A platform team rewords the intent classifier prompt to "better handle compound questions." One sentence changes. Their own eval suite goes green — compound-question accuracy improves 6 points. They merge at 3pm. By 5pm, three downstream agent teams are paging: the routing agent is sending refund requests to the shipping queue, the summarizer agent is truncating at a different boundary, and the ticket-tagger has started emitting a category that no schema recognizes. None of those downstream teams were in the review. Nobody was on call for "the intent prompt."

This is not a hypothetical. It is what happens when a prompt becomes a shared dependency without becoming a shared API. A prompt change that improves one team's metric can silently invalidate the assumptions another team built on top. And unlike a breaking API change, there is no deserialization error, no schema mismatch, no 500 — the downstream just starts making subtly worse decisions.

Traditional API engineering solved this decades ago with contract tests. The consumer publishes the shape of what it expects; the provider is obligated to keep that shape working. Pact, consumer-driven contracts, shared schemas — this is release-engineering orthodoxy for HTTP services. Prompts deserve the same discipline, and most organizations still treat them like sticky notes passed between teams.

The Few-Shot Saturation Curve: Why Adding More Examples Eventually Hurts

· 9 min read
Tian Pan
Software Engineer

A team testing Gemini 3 Flash on a route optimization task watched their model score 93% accuracy at zero-shot. They added examples, performance climbed — and then at eight examples it collapsed to 30%. That's not noise. That's the few-shot saturation curve biting hard, and it's a failure mode most engineers only discover after deploying a prompt that seemed fine at four examples and broken at twelve.

The intuition that more examples is strictly better is wrong. The data across 12 LLMs and dozens of task types shows three distinct failure patterns: steady plateau (gains flatten), peak regression (gains then crash), and selection-induced collapse (gains that evaporate when you switch example retrieval strategy). Understanding which pattern you're in changes how you build prompts, when you give up on few-shot entirely, and whether you should be fine-tuning instead.