720 posts tagged with "llm"

Function Calling vs Code Generation for Agent Actions: The Tradeoffs Nobody Benchmarks

May 5, 2026 · 10 min read

Software Engineer

An agent running in production once received the instruction "clean up the test data" and executed a DROP TABLE command against a production database. The tool call succeeded. The audit log showed a perfectly structured JSON payload. The agent had done exactly what it was asked — just not what anyone meant. This isn't a story about prompt injection. It's a story about an architectural choice: the team had given their agent the ability to generate and execute arbitrary code, and they had underestimated what that actually means at runtime.

The choice between function calling and code generation as the action layer for AI agents is one of the most consequential decisions in agent architecture, and almost nobody benchmarks it directly. Papers measure accuracy on task completion; they rarely measure the failure modes that matter in production — silent semantic errors, irreversible side effects, security exposure surface, and debugging cost when something goes wrong.

The Generalization Cliff: How Fine-Tuning Creates Silent Capability Regressions

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

A team at an enterprise software company fine-tuned a 7B model on customer support tickets. The target metric — resolution accuracy — improved by 12 percentage points. The team shipped it. Three weeks later, the product had a second failure mode nobody expected: the model had quietly lost the ability to handle multi-step questions. Users would ask something slightly outside the support domain and receive a confident but incoherent answer. The model had traded breadth it didn't know it needed for depth it could measure.

This is the generalization cliff: the silent capability degradation that follows narrow fine-tuning. Unlike a crash or a timeout, it produces no error. The model still responds. It just responds worse on tasks adjacent to its training distribution — and those tasks never appeared in the eval suite.

The Helpful-But-Wrong Problem: Operational Hallucination in Production AI Agents

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI agent just completed a complex database migration task. It called the right tool, used proper terminology, referenced the correct library, and returned output that looks completely reasonable. Then your DBA runs it against a 50M-row production table — and the backup flag was wrong. The flag exists in a neighboring library version, it's syntactically valid, but it silently no-ops the backup step.

The agent wasn't hallucinating wildly. It was confident, fluent, and directionally correct. It was also operationally wrong in exactly the way that causes data loss.

This is the hallucination category the field underinvests in, the one that your evals are almost certainly not catching.

The Hyperparameter Illusion: Why Temperature and Top-P Are the Last Things to Tune

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

When LLM outputs feel wrong, engineers reach for the temperature dial. It's one of the first moves in the debugging playbook — crank it down for more consistency, nudge it up for more creativity. It feels productive because it's easy to change and produces immediately visible effects. It is almost never the right move.

Temperature and top-p are the last 10% of output quality, not the first 90%. The variables that actually determine whether your model succeeds are context quality, instruction clarity, and model selection — in that order. Misconfiguring sampling parameters on top of a broken prompt is like adjusting the seasoning on a dish that hasn't been cooked through. The fundamental problem doesn't move.

The Inherited AI System Audit: How to Take Ownership of an LLM Feature You Didn't Build

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Someone left. The onboarding doc says "ask Sarah" but Sarah is at a different company now. You're staring at a 900-line system prompt with sections titled things like ## DO NOT REMOVE THIS SECTION, and you have no idea what happens if you do.

This is the inherited AI system problem, and it's different from inheriting regular code. With legacy code, a determined engineer can trace execution paths, read tests, and reconstruct intent from behavior. With an inherited LLM feature, the prompt is the logic — but it's written in natural language, its failure modes are probabilistic, and the author's intent is trapped inside their head. There are no stack traces that tell you which guardrail fired and why.

Lazy Evaluation in AI Pipelines: Stop Calling the LLM Until You Have To

May 5, 2026 · 11 min read

Tian Pan

Software Engineer

Most AI pipelines are written as if every request deserves a full LLM call. The user submits a message, the pipeline passes it to the model, waits for a response, and returns it — every time, unconditionally. This works, but it's expensive, slow, and often unnecessary.

The fraction of requests that actually require a full LLM inference is smaller than most engineers assume. Research on token-level routing shows that only about 11% of tokens differ between a 1.5B and a 32B parameter model, and only 4.9% of tokens are genuinely "divergent" — meaning they alter the reasoning path if handled by the smaller model. Production semantic caches show that 65% of incoming traffic is semantically similar to something the pipeline has already answered. These aren't edge cases. They're the majority of your traffic, and you're paying full price to handle them.

The fix is lazy evaluation: don't invoke the expensive model until you've confirmed that the expensive model is actually needed.

LLM Code Review in Production: Building a Diff Pipeline That Engineers Actually Trust

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams that deploy an LLM code reviewer discover the same failure mode within two weeks: the model produces 10–20 comments per pull request, 80% of which are noise. After the third PR where a developer dismisses every comment without reading them, the tool is effectively dead — notifications routed to a channel no one watches, the bot still spending compute on every push.

The problem isn't the model. It's that the teams shipped a comment generator and called it a reviewer.

The Feature Store Pattern for LLM Applications: Stop Retrieving What You Could Precompute

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building LLM applications eventually converge on the same ad-hoc architecture: a scatter of cron jobs computing user summaries, a vector database queried fresh on every request, a Redis cache added when latency got embarrassing, and three different codebases that all define "user preference" slightly differently. Only later, usually after a production incident, do they recognize what they built: a feature store — a bad one, assembled accidentally.

The feature store is one of the most battle-tested patterns in traditional ML infrastructure. Applied deliberately to LLM context assembly, it eliminates the latency, cost, and consistency problems that plague most retrieval pipelines. This post explains how.

Multi-Model Consensus: When One LLM Isn't Enough to Sign Off

May 5, 2026 · 11 min read

Tian Pan

Software Engineer

Your AI feature ships with 85% accuracy. Leadership is thrilled. Then a compliance audit finds that the 15% wrong answers cluster around a specific regulatory interpretation — one that every model in your provider's family gets wrong in the same way. You called one model. It failed. And because you never compared it to anything else, you had no signal that the failure was systematic.

Multi-model consensus architecture is the structural answer to this problem. Instead of trusting a single LLM, you fan out to multiple models from different provider families, aggregate their responses, and route based on agreement. The disagreement pattern itself becomes a first-class signal in your system, not just a debugging artifact.

This approach costs 2–4× more per inference. For most use cases, that's obviously not worth it. But for a specific class of outputs — legal summaries, medical triage routing, financial risk flags, security assessments — the cost of a wrong answer so far exceeds the cost of extra inference that the math inverts almost immediately.

The Overfitting Org: When Your AI Team's Model Expertise Becomes a Liability

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Your best AI engineer can recite Claude's XML formatting preferences from memory. They know that Claude Opus refuses to generalize implicit instructions, that few-shot examples actually hurt performance on o1-series models, and that Azure OpenAI imposes an extra 8–12 seconds of latency versus the direct API in some regions. This expertise took months to accumulate. It also represents one of the most underappreciated risks in AI engineering today.

When a provider deprecates a model or silently shifts behavior, that knowledge doesn't transfer. It vanishes. And teams that built their systems — and their institutional competence — around a single model family often discover this the hard way.

The Population Prompt Problem: Why Your System Prompt Works for 80% of Users and Silently Fails the Other 20%

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

When you write a system prompt, you have a user in mind. Maybe it's the competent professional asking a focused question in clear English. Maybe it's someone who sends a short, well-scoped request that fits neatly inside your prompt's assumptions. You test against examples that feel representative, tune until the outputs look good, and ship.

Then you see production traffic.

The real population of queries your system prompt must handle is not the median case you designed for. It's a distribution — some narrow, many diffuse — with a long tail of edge cases that expose every assumption baked into your instructions. For most production systems, somewhere between 15% and 30% of real queries fall into categories the prompt handles poorly. The unsettling part: most of these failures are silent. Your system returns a 200, the user gets an answer that looks plausible, and the failure never surfaces in your logs.

Prompt Contract Testing: How Teams Building Different Agents Coordinate Without Breaking Each Other

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

When two microservices diverge in their API assumptions, your integration tests catch it before production does. When two agents diverge in their prompt assumptions, you find out when a customer gets contradictory answers—or when a cascading failure takes down the entire pipeline. Multi-agent AI systems fail at rates of 41–87% in production. More than a third of those failures aren't model quality problems; they're coordination breakdowns: one agent changed how it formats output, another still expects the old schema, and nobody has a test for that.

The underlying problem is that agents communicate through implicit contracts. A research agent agrees—informally, in someone's mental model—to return results as a JSON object with a sources array. The orchestrating agent depends on that shape. Nobody writes this down. Nobody tests it. Six weeks later the research agent's prompt is refined to return a ranked list instead, and the orchestrator silently drops half its inputs.

About Tian Pan