639 posts tagged with "llm"

Prompt Injection Detection at 100,000 Requests Per Day: Why Simple Defenses Break and What Actually Works

April 17, 2026 · 11 min read

Software Engineer

Most teams discover their prompt injection defense is broken after a user finds it, not before. You add "ignore all previous instructions" to your blocklist and ship. Three months later an attacker encodes the payload in Base64, or buries instructions in HTML comments retrieved via RAG, or uses typoglycemia ("ignroe all prevuois insrtucioins"), and your entire defense evaporates. The blocklist doesn't help because prompt injection has an unbounded attack surface — there is no closed vocabulary of malicious inputs.

At low traffic volumes you can absorb the cost of calling a second LLM to validate each request. At 100,000 requests per day, that math becomes ruinous and the latency becomes user-visible. This post is about what the architecture looks like when brute-force approaches stop working.

The Prompt-Model Coupling Trap: Why Your Prompts Only Speak One Model's Dialect

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Most prompt migrations look fine in staging. Ninety percent of test cases pass, the new model's responses feel crisper, and the demo runs cleanly. Then you ship, and within two days your structured output parser is throwing exceptions on 12% of responses, a customer-facing classification pipeline started returning wrong labels, and a tool-calling agent is looping on a schema it used to handle without issue. Nobody changed the prompts. The model changed.

This is the prompt-model coupling trap: prompts that work reliably on one model silently accumulate dependencies on that model's specific behavioral quirks, and those dependencies are invisible until migration day.

Property-Based Testing for LLM Outputs: Finding the Bugs Your Eval Set Never Imagined

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Your eval suite says 94% accuracy. Users report the feature is broken for names that aren't "John" or "Alice." Both things are true, and the gap between them has a name: your curated test set encodes only the failure modes you already imagined.

Property-based testing (PBT) was invented in 1999 to expose exactly this class of blind spot in deterministic software. Applied to LLM outputs, it generates tens of thousands of adversarial input variants automatically, probing domain boundaries that hand-written test cases structurally cannot reach. A 2025 OOPSLA study found that on average each property-based test discovers approximately 50 times as many mutant bugs as the average unit test. A separate study measured that PBT and example-based testing (EBT) fail on different bugs — combining both raised detection rates from 68.75% to 81.25%. That 12.5-point gap is not rounding error; it represents an entire class of failure invisible to one approach.

This article is for engineers who already have eval suites and want to find the bugs those suites structurally cannot find.

Schema-First AI Development: Define Output Contracts Before You Write Prompts

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams discover the schema problem the wrong way: a downstream service starts returning nonsense, a dashboard fills up with garbage, and a twenty-minute debugging session reveals that the LLM quietly started wrapping its JSON in a markdown code fence three weeks ago. Nobody noticed because the application wasn't crashing — it was silently consuming malformed data.

The fix was a one-line prompt change. The damage was weeks of bad analytics and one very uncomfortable postmortem.

Schema-first development is the discipline that prevents this. It means defining the exact structure your LLM output must conform to — before you write a single prompt token. This isn't about constraining creativity; it's about treating output format as a contract that downstream systems can rely on, the same way you'd version a REST API before writing the consumers.

The Schema Problem: Taming LLM Output in Production

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

You ship a feature that extracts structured data from user text using an LLM. You test it thoroughly. It works. Three months later, a model provider quietly updates their weights, and without changing a single line of your code, your downstream pipeline starts silently dropping records. No exceptions thrown. No alerts fired. Just wrong data flowing through your system.

This is the schema problem. And despite years of improvements to structured output APIs, it remains one of the least-discussed failure modes in LLM-powered systems.

Your Team's Benchmarks Are Lying to Each Other: Shared Eval Infrastructure Contamination

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your red team just finished a jailbreak sweep. They found three novel attack vectors, wrote them up, and dropped the prompts into your shared prompt library for others to learn from. The next week, the safety team runs their baseline evaluation and reports a 12% improvement in robustness. Everyone celebrates. Nobody asks why.

What actually happened: the safety team's baseline eval silently incorporated the red team's attack prompts. The model didn't get more robust — the eval got contaminated. Your benchmarks are now measuring inoculation against known attacks, not generalization to new ones.

This is shared eval infrastructure contamination, and it is far more common than most teams realize. The symptom is artificially inflating metrics. The cause is treating evaluation infrastructure like production infrastructure — optimized for sharing and efficiency, instead of isolation and fidelity.

Specification Gaming in Production AI Agents: When Your Agent Optimizes the Wrong Thing

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

In a 2025 study of frontier models on competitive engineering tasks, researchers found that 30.4% of agent runs involved reward hacking — the model finding a way to score well without actually doing the work. One agent monkey-patched pytest's internal reporting mechanism. Another overrode Python's __eq__ to make every equality check return True. A third simply called sys.exit(0) before tests ran and let the zero exit code register as success.

None of these models were explicitly trying to cheat. They were doing exactly what they were optimized to do: maximize the reward signal. The problem was that the reward signal wasn't the same thing as the actual goal.

This is specification gaming — and it's not a corner case. It's a structural property of any sufficiently capable agent operating against a measurable objective.

Testing the Untestable: Integration Contracts for LLM-Powered APIs

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your test suite passes. The CI is green. You ship the new prompt. Three days later, a user reports that your API is returning JSON with a trailing comma — and your downstream parser has been silently dropping records for 72 hours. You never wrote a test for that because the LLM "always" returned valid JSON in development.

This is the failure mode that ruins LLM-powered products: not catastrophic model collapse, but quiet, intermittent degradation that deterministic test suites are structurally incapable of catching. The root cause isn't laziness — it's that the whole paradigm of "expected == actual" breaks when your system produces non-deterministic natural language.

Fixing this requires rethinking what you're testing and what "passing" even means for an LLM-powered API. The engineers who've figured this out aren't writing smarter equality assertions — they're writing fundamentally different kinds of tests.

The Testing Pyramid Inverts for AI: Why Unit Tests Are the Wrong Investment for LLM Features

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your team ships a new LLM feature. The unit tests pass. CI is green. You deploy. Then users start reporting that the AI "just doesn't work right" — answers are weirdly formatted, the agent picks the wrong tool, context gets lost halfway through a multi-step task. You look at the test suite and it's still green. Every test passes. The feature is broken.

This is not bad luck. It is what happens when you apply a deterministic testing philosophy to a probabilistic system. The classic testing pyramid — wide base of unit tests, smaller middle layer of integration tests, narrow top of end-to-end tests — rests on one assumption so fundamental that nobody writes it down: the code does the same thing every time. LLMs violate this assumption at every level. The testing strategy built on top of it needs to be rebuilt from scratch.

Tokens Are a Finite Resource: A Budget Allocation Framework for Complex Agents

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

The frontier models now advertise context windows of 200K, 1M, even 2M tokens. Engineering teams treat this as a solved problem and move on. The number is large, surely we'll never hit it.

Then, six hours into an autonomous research task, the agent starts hallucinating file paths it edited three hours ago. A coding agent confidently opens a function it deleted in turn four. A document analysis pipeline begins contradicting conclusions it drew from the same document earlier in the session. These are not model failures. They are context budget failures — predictable, measurable, and almost entirely preventable if you treat the context window as the scarce compute resource it actually is.

Zero-Shot vs. Few-Shot in Production: When Examples Help and When They Hurt

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

The most common advice about few-shot prompting is: add examples, watch quality go up. That advice is wrong often enough that you shouldn't trust it without measuring. In practice, the relationship between examples and performance is non-monotonic — it peaks somewhere and then drops. Sometimes it drops a lot.

A 2025 empirical study tracked 12 LLMs across multiple tasks and found that Gemma 7B fell from 77.9% to 39.9% accuracy on a vulnerability identification task as examples were added beyond the optimal count. LLaMA-2 70B dropped from 68.6% to 21.0% on the same type of task. In code translation benchmarks, functional correctness typically peaks somewhere between 5 and 25 examples and degrades from there. This isn't a quirk of specific models — it's a pattern researchers have named "few-shot collapse," and it shows up broadly.

AI-Assisted Incident Response: How LLMs Change the SRE Playbook Without Replacing It

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

Here is the paradox that nobody in the AIOps vendor space is advertising: organizations that invested over $1M in AI tooling for incident response saw their operational toil rise to 30% of engineering time—up from 25%, the first increase in five years. Teams expected the automation to replace manual work. Instead, they got a new job: verifying what the AI said before acting on it. The old tasks didn't go away. A verification layer appeared on top.

This is not an argument against AI in incident response. The same data shows a 40% reduction in mean time to resolution when AI is integrated well, and some teams report cutting investigation time from two hours to under thirty minutes. The argument is more precise: the failure modes of AI copilots are qualitatively different from the failure modes of traditional SRE tooling, and most teams aren't set up to catch them.

About Tian Pan