639 posts tagged with "llm"

The Reasoning-Model Tax at Tool Boundaries

April 23, 2026 · 10 min read

Software Engineer

Extended thinking wins benchmarks on novel reasoning. At a tool boundary — the moment your agent has to pick which function to call, when to call it, and what arguments to pass — that same thinking budget often makes things worse. The model weighs three equivalent tools that a fast model would have disambiguated in one token. It manufactures plausible-sounding ambiguity where none existed. It burns a thousand reasoning tokens to second-guess the obvious search call, then calls search anyway. You paid the reasoning tax on a decision that didn't need reasoning.

This is the quiet cost center of agentic systems in 2026: not the reasoning model itself, which is priced fairly for what it does well, but the reasoning model deployed at the wrong step of the loop. The anti-pattern hides in plain sight because the top-of-loop task looks hard ("answer the user's question"), so teams wrap the entire loop in high-effort thinking mode and never notice that 80% of the thinking budget is being spent deliberating on tool-choice micro-decisions the model already got right on its first instinct.

The Reflection Placebo: Why Plan-Reflect-Replan Loops Return Version One

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

Open an agent's trace during a long-horizon planning task and count the number of times the model writes "let me reconsider," "on reflection," or "a better approach would be." Now compare the plan it finally commits to with the one it drafted first. In the majority of traces I've audited, the second plan is the first plan wearing a different hat — the same decomposition, the same tool calls, the same order of operations, with some renamed step labels and a reworded rationale. The reflection ran. The model emitted tokens that looked like reconsideration. The plan did not move.

This matters because "with reflection" has quietly become a quality tier. Teams ship planners with one, two, or three reflection rounds and bill themselves for the difference. The inference spend is real and measurable. Whether anything on the plan side actually changed is a question almost nobody instruments for, and the answer is frequently: no.

The Refusal Training Gap: Why Your Model Says No to the Wrong Questions

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

A user asks your assistant, "How do I kill a Python process that's hung?" and gets a polite refusal about violence. Another user asks, "Who won the 2003 Nobel Prize in Physics?" and gets a confidently invented name. Both responses came out of the same model, both passed your safety review, and both will be in your support inbox by Monday. The frustrating part is that these are not two separate failures with two separate fixes. They are the same failure: your model has been trained to recognize refusal templates, not to recognize what it actually shouldn't answer.

The industry has spent three years getting models to refuse policy-violating requests. It has spent almost no time teaching them to refuse questions they cannot reliably answer. The result is a refusal capability that is misaimed: heavily reinforced on surface patterns ("kill," "exploit," "bypass"), barely trained on epistemic state ("I don't know who that is"). When you only optimize one direction, you get a model that says no to the wrong questions and yes to the wrong questions, often within the same conversation.

The Right-Edge Accuracy Drop: Why the Last 20% of Your Context Window Is a Trap

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

A 200K-token context window is not a 200K-token context window. Fill it to the brim and the model you just paid for quietly becomes a worse version of itself — not at the middle, where "lost in the middle" would predict, but at the right edge, exactly where recency bias was supposed to save you. The label on the box sold you headroom; the silicon sells you a cliff.

This is a different failure mode from the one most teams have internalized. "Lost in the middle" trained a generation of prompt engineers to stuff the critical instruction at the top and the critical question at the bottom, confident that primacy and recency would carry the signal through. That heuristic silently breaks when utilization approaches the claimed window. The drop-off is not gradual, not linear, and not symmetric with how the model behaves at half-fill. Past a utilization threshold that varies by model, you are operating in a different regime, and the prompt shape that worked at 30K fails at 180K.

The economic temptation makes it worse. If you just paid for a million-token window, the pressure to use it is enormous — dump the entire repo, feed it every support ticket, hand it the quarterly filings and let it figure out what matters. That is how you get a confidently wrong answer that looks well-reasoned on the surface and disintegrates on audit.

Sampling Bias in Agent Traces: Why Your Debug Dataset Silently Excludes the Failures You Care About

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The debug corpus your team stares at every Monday is not a representative sample of production. It is an actively biased one, and the bias is in exactly the wrong direction. Head-based sampling at 1% retains the median request a hundred times before it keeps a single rare catastrophic trajectory — and most teams discover this only when a failure mode that has been quietly recurring for months finally drives a refund or an outage, and they go looking for examples in the trace store and find none.

This is not an exotic edge case. It is the default behavior of every observability stack that was designed for stateless web services and then pointed at a long-horizon agent. The same sampling math that worked fine for HTTP request tracing systematically erases the trajectories that matter most when each "request" is a thirty-step plan that may invoke a dozen tools, regenerate three subplans, and consume tens of thousands of tokens before something subtle goes wrong on step twenty-seven.

The fix is not "sample more." Sampling more makes the bill explode without changing the bias — you just get more of what you already had too much of. The fix is to change what you sample, keyed on outcomes you can only know after the trajectory finishes. That requires throwing out the head-based defaults and rebuilding the retention layer around tail signals, anomaly weighting, and bounded reservoirs that survive the long tail of agent execution.

Semantic Cache Is a Safety Problem, Not a Perf Win

April 23, 2026 · 12 min read

Tian Pan

Software Engineer

A semantic cache hit is the only LLM optimization that can serve the wrong answer to the wrong user in under a millisecond. SQL caches return your row or someone else's because somebody wrote a bad join — the failure mode is a query bug. Semantic caches return another tenant's response because two embeddings landed within 0.03 cosine of each other, which is the system working exactly as designed. The cache is doing its job. The job is the problem.

Most teams ship semantic caching as a cost initiative — there's a "70% bill reduction" deck floating around every AI engineering Slack — and review the cache key the way they'd review a Redis TTL: not at all. That review goes to the perf team. The safety team never sees the design doc because nobody filed a security review for "we added a faster path." Six months later somebody's compliance audit finds that "I can't log into my account, my email is [email protected]" and "I can't log into my account, my email is [email protected]" both vectorized within threshold of "I can't log into my account" and the cache cheerfully served Bob the response originally generated for Jane, including the password reset link her account had asked for.

This post is about why semantic caches deserve the same review rigor as SQL predicates, the cache-key design that prevents cross-user leak by construction, and the audit trail you need to distinguish "cache hit served the right answer" from "cache hit served someone else's answer at sub-millisecond latency."

Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.

A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.

The Validator Trap: How Post-Hoc Guards Rot Your Prompt From the Inside

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The first time a validator catches a bad LLM output, it feels like a win. The second time, you tweak the prompt to make the failure less likely. By the twentieth time, nobody on the team can explain why three paragraphs of the prompt exist — they are scar tissue from incidents long forgotten, and the model is spending more tokens reading warnings than reasoning about the actual task.

This is the validator trap. Every post-hoc guard you add — a JSON schema check, a regex, a content classifier, a second LLM-as-judge — exerts feedback pressure on the upstream prompt. The prompt grows defensive instructions to appease the guard, the guard in turn catches a new class of failure, and you add more instructions. Each iteration looks local and sensible. In aggregate, the system gets slower, more expensive, and measurably worse at the task you originally designed it for.

Design Your Agent State Machine Before You Write a Single Prompt

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Most engineers building their first LLM agent follow the same sequence: write a system prompt, add a loop that calls the model, sprinkle in some tool-calling logic, and watch it work on a simple test case. Six weeks later, the agent is an incomprehensible tangle of nested conditionals, prompt fragments pasted inside f-strings, and retry logic scattered across three files. Adding a feature requires reading the whole thing. A production bug means staring at a thousand-token context window trying to reconstruct what the model was "thinking."

This is the spaghetti agent problem, and it's nearly universal in teams that start with a prompt rather than a design. The fix isn't a better prompting technique or a different framework. It's a discipline: design your state machine before you write a single prompt.

AI Incident Response Playbooks: Why Your On-Call Runbook Doesn't Work for LLMs

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your monitoring dashboard shows elevated latency, a small error rate spike, and then nothing. Users are already complaining in Slack. A quarter of your AI feature's responses are hallucinating in ways that look completely valid to your alerting system. By the time you find the cause — a six-word change to a prompt deployed two hours ago — you've had a slow-burn incident that your runbook never anticipated.

This is the defining challenge of operating AI systems in production. The failure modes are real, damaging, and invisible to conventional tooling. An LLM that silently hallucinates looks exactly like an LLM that's working correctly from the outside.

AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your customer support AI told a passenger he could buy a full-fare ticket and claim a retroactive bereavement discount afterward. He trusted it, flew, and filed the claim. The company denied it. A tribunal ruled the company liable for $650 anyway — because there was no distinction in the law between a human employee and a chatbot giving authoritative-sounding advice. The chatbot wasn't crashing. No alerts fired. No p99 latency spiked. The system was "working."

That is the defining characteristic of AI incidents: the application doesn't fail — it succeeds at producing the wrong output, confidently and at scale. And when you sit down to write the post-mortem, the classical toolbox falls apart.

The Alignment Tax: When Safety Features Make Your AI Product Worse

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

A developer asks your AI coding assistant to "kill the background process." A legal research tool refuses to discuss precedent on a case involving violence. A customer support bot declines to explain a refund policy because the word "dispute" triggered a content classifier. In each case, the AI was doing exactly what it was trained to do — and it was completely wrong.

This is the alignment tax: the measurable cost in user satisfaction, task completion, and product trust that your safety layer extracts from entirely legitimate interactions. Most AI teams treat it as unavoidable background noise. It isn't. It's a tunable product parameter — one that many teams are accidentally maxing out.

About Tian Pan