Skip to main content

702 posts tagged with "llm"

View all tags

LLM Code Review in Production: Building a Diff Pipeline That Engineers Actually Trust

· 9 min read
Tian Pan
Software Engineer

Most teams that deploy an LLM code reviewer discover the same failure mode within two weeks: the model produces 10–20 comments per pull request, 80% of which are noise. After the third PR where a developer dismisses every comment without reading them, the tool is effectively dead — notifications routed to a channel no one watches, the bot still spending compute on every push.

The problem isn't the model. It's that the teams shipped a comment generator and called it a reviewer.

The Feature Store Pattern for LLM Applications: Stop Retrieving What You Could Precompute

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications eventually converge on the same ad-hoc architecture: a scatter of cron jobs computing user summaries, a vector database queried fresh on every request, a Redis cache added when latency got embarrassing, and three different codebases that all define "user preference" slightly differently. Only later, usually after a production incident, do they recognize what they built: a feature store — a bad one, assembled accidentally.

The feature store is one of the most battle-tested patterns in traditional ML infrastructure. Applied deliberately to LLM context assembly, it eliminates the latency, cost, and consistency problems that plague most retrieval pipelines. This post explains how.

Multi-Model Consensus: When One LLM Isn't Enough to Sign Off

· 11 min read
Tian Pan
Software Engineer

Your AI feature ships with 85% accuracy. Leadership is thrilled. Then a compliance audit finds that the 15% wrong answers cluster around a specific regulatory interpretation — one that every model in your provider's family gets wrong in the same way. You called one model. It failed. And because you never compared it to anything else, you had no signal that the failure was systematic.

Multi-model consensus architecture is the structural answer to this problem. Instead of trusting a single LLM, you fan out to multiple models from different provider families, aggregate their responses, and route based on agreement. The disagreement pattern itself becomes a first-class signal in your system, not just a debugging artifact.

This approach costs 2–4× more per inference. For most use cases, that's obviously not worth it. But for a specific class of outputs — legal summaries, medical triage routing, financial risk flags, security assessments — the cost of a wrong answer so far exceeds the cost of extra inference that the math inverts almost immediately.

The Overfitting Org: When Your AI Team's Model Expertise Becomes a Liability

· 9 min read
Tian Pan
Software Engineer

Your best AI engineer can recite Claude's XML formatting preferences from memory. They know that Claude Opus refuses to generalize implicit instructions, that few-shot examples actually hurt performance on o1-series models, and that Azure OpenAI imposes an extra 8–12 seconds of latency versus the direct API in some regions. This expertise took months to accumulate. It also represents one of the most underappreciated risks in AI engineering today.

When a provider deprecates a model or silently shifts behavior, that knowledge doesn't transfer. It vanishes. And teams that built their systems — and their institutional competence — around a single model family often discover this the hard way.

The Population Prompt Problem: Why Your System Prompt Works for 80% of Users and Silently Fails the Other 20%

· 10 min read
Tian Pan
Software Engineer

When you write a system prompt, you have a user in mind. Maybe it's the competent professional asking a focused question in clear English. Maybe it's someone who sends a short, well-scoped request that fits neatly inside your prompt's assumptions. You test against examples that feel representative, tune until the outputs look good, and ship.

Then you see production traffic.

The real population of queries your system prompt must handle is not the median case you designed for. It's a distribution — some narrow, many diffuse — with a long tail of edge cases that expose every assumption baked into your instructions. For most production systems, somewhere between 15% and 30% of real queries fall into categories the prompt handles poorly. The unsettling part: most of these failures are silent. Your system returns a 200, the user gets an answer that looks plausible, and the failure never surfaces in your logs.

Prompt Contract Testing: How Teams Building Different Agents Coordinate Without Breaking Each Other

· 10 min read
Tian Pan
Software Engineer

When two microservices diverge in their API assumptions, your integration tests catch it before production does. When two agents diverge in their prompt assumptions, you find out when a customer gets contradictory answers—or when a cascading failure takes down the entire pipeline. Multi-agent AI systems fail at rates of 41–87% in production. More than a third of those failures aren't model quality problems; they're coordination breakdowns: one agent changed how it formats output, another still expects the old schema, and nobody has a test for that.

The underlying problem is that agents communicate through implicit contracts. A research agent agrees—informally, in someone's mental model—to return results as a JSON object with a sources array. The orchestrating agent depends on that shape. Nobody writes this down. Nobody tests it. Six weeks later the research agent's prompt is refined to return a ranked list instead, and the orchestrator silently drops half its inputs.

Prompt Credit Assignment: Finding the Dead Weight in Your System Prompt

· 11 min read
Tian Pan
Software Engineer

Most teams discover their system prompt has a weight problem the same way — a cost review, a latency spike, or an engineer who finally reads the thing end to end. What they find is typically a 2,000-token document that grew organically over six months, with three versions of "be concise" scattered across different sections, instructions that reference a product workflow that was deprecated in February, and a dozen rules that the model visibly ignores on every run. The prompt is large. Most of it isn't doing anything.

This is the prompt credit assignment problem: figuring out which instructions in a multi-thousand-token system prompt actually drive model behavior, and which are just dead weight that burns tokens and dilutes attention. The bad news is that most teams skip this entirely — they add instructions when behavior breaks and never subtract. The good news is there is a repeatable engineering discipline for it.

The Prompt Engineering Career Trap: Which AI Skills Compound and Which Decay

· 9 min read
Tian Pan
Software Engineer

In 2023, "prompt engineer" was one of the most searched job titles in tech. LinkedIn was full of engineers rebranding their profile summaries. Job postings promised six-figure salaries for people who knew how to coax GPT-4 into behaving. What the job descriptions didn't say was that many of the skills they listed were already on borrowed time — and that the engineers who noticed the difference between durable and decaying skills would end up in very different places by 2026.

The prompt engineering career trap is not that the field went away. It's that it changed so fast that skills built over 12 months became liabilities by the 18-month mark. Engineers who invested heavily in the wrong layer and ignored the right one found themselves holding expertise in things the next model revision made irrelevant.

Prompt Mutation Testing: Finding Which System Prompt Instructions Actually Matter

· 10 min read
Tian Pan
Software Engineer

There is a certain kind of engineering debt that never shows up in your metrics. You accumulate it every time someone adds a sentence to the system prompt to fix a one-off complaint — a phrase like "never discuss competitor products" or "always respond in a formal tone" — and then nobody ever verifies whether the model actually enforces it. Over months, the prompt grows to 800 tokens. It sounds authoritative. It contains multitudes. And maybe a third of it does nothing.

Prompt mutation testing is the practice of finding out which third. The technique borrows its name from classical mutation testing in software engineering: systematically introduce small, deliberate faults into your code to determine whether your test suite would actually catch them. Here, you introduce deliberate perturbations into your system prompt — remove a clause, contradict a rule, substitute a critical keyword with a near-synonym — and measure how much the model's output actually changes. Instructions that survive perturbation without affecting behavior are decorative. Instructions that break things when touched are load-bearing.

When RAG Makes Your AI Worse: The Creativity-Grounding Tradeoff

· 8 min read
Tian Pan
Software Engineer

A team at a product company built a brainstorming assistant for their marketing department. They added RAG over their document corpus — campaign briefs, brand guidelines, competitor analyses — figuring the richer context would produce better ideas. Usage dropped within three weeks. The qualitative feedback: outputs felt "too safe," "too predictable," "like it just remixed our existing stuff." They removed retrieval from the brainstorming feature. Ideas improved. Engagement recovered.

This pattern repeats more often than practitioners admit. Retrieval-augmented generation has become the default architecture for grounding LLM outputs in facts, and for factual tasks it earns that default. But for generative tasks — ideation, creative writing, novel solution generation — adding a retrieval layer can silently cap the ceiling of what your model produces. Not because retrieval is broken, but because it's working exactly as designed.

The Stakeholder Explanation Layer: Building AI Transparency That Regulators and Executives Actually Accept

· 12 min read
Tian Pan
Software Engineer

When legal asks "why did the AI deny this loan application?", your chain-of-thought trace is the wrong answer. It doesn't matter that you have 1,200 tokens of step-by-step reasoning. What they need is a sentence that holds up in a deposition — and right now, most engineering teams have no idea how to produce it.

This is the stakeholder explanation gap: the distance between what engineers understand about model behavior and what regulators, executives, and legal teams need to do their jobs. Closing it requires a distinct architectural layer — one that most production AI systems never build.

The System Prompt Is a Software Interface, Not a Config String

· 9 min read
Tian Pan
Software Engineer

Most teams treat their system prompts the way early web developers treated CSS: paste something that works, modify it carefully to not break anything, commit it to a config file, and hope nobody touches it. Then a new team member "cleans it up," a model upgrade subtly changes behavior, and three weeks later a user files a bug that nobody can reproduce because nobody knows what the prompt actually said last Tuesday.

This isn't a workflow problem. It's a category error. System prompts aren't configuration — they're software interfaces. And until engineering teams treat them as such, the LLM features they build will remain fragile, hard to debug, and impossible to scale.