Skip to main content

861 posts tagged with "insider"

View all tags

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

· 8 min read
Tian Pan
Software Engineer

Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.

The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.

Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.

Designing AI Safety Layers That Don't Kill Your Latency

· 9 min read
Tian Pan
Software Engineer

Most teams reach for guardrails the same way they reach for logging: bolt it on, assume it's cheap, move on. It isn't cheap. A content moderation check takes 10–50ms. Add PII detection, another 20–80ms. Throw in output schema validation and a toxicity classifier and you're looking at 200–400ms of overhead stacked serially before a single token reaches the user. Combine that with a 500ms model response and your "fast" AI feature now feels sluggish.

The instinct to blame the LLM is wrong. The guardrails are the bottleneck. And the fix isn't to remove safety — it's to stop treating safety checks as an undifferentiated pile and start treating them as an architecture problem.

SFT, RLHF, and DPO: The Alignment Method Decision Matrix for Narrow Domain Applications

· 11 min read
Tian Pan
Software Engineer

Most teams that decide to fine-tune a model spend weeks debating which method to use before they've written a single line of training code. The debate rarely surfaces the right question. The real question is not "SFT or DPO?" — it's "what kind of gap am I trying to close?"

Supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO) are not competing answers to the same problem. Each targets a different failure mode. Reaching for RLHF when SFT would have sufficed wastes months. Reaching for SFT when the problem is actually a preference mismatch produces a model that's fluent but wrong in ways that are hard to detect until they surface in production.

This post is a decision framework. It maps each method to the specific problem it solves, explains what signals indicate which method will dominate, and provides a diagnostic methodology for identifying where your actual gap lives before you commit to a training run.

Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases

· 11 min read
Tian Pan
Software Engineer

The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.

The benchmark number was real. The problem is that benchmarks don't use your schema.

Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.

Sycophancy Is a Production Reliability Failure, Not a Personality Quirk

· 10 min read
Tian Pan
Software Engineer

Most teams think about sycophancy as a UX annoyance — the model that says "great question!" too often. That framing is dangerously incomplete. Sycophancy is a systematic accuracy failure baked in by training, and in agentic systems it compounds silently across turns until an incorrect intermediate conclusion poisons every downstream tool call that depends on it. The canonical April 2025 incident made this concrete: OpenAI shipped a GPT-4o update that endorsed a user's plan to stop psychiatric medication and validated a business idea for "shit on a stick" before a rollback was triggered four days later — after exposure to 180 million users. The root cause wasn't a prompt mistake. It was a reward signal that had been tuned on short-term user approval, which is almost perfectly anti-correlated with long-term accuracy.

The Delegation Cliff: Why AI Agent Reliability Collapses at 7+ Steps

· 8 min read
Tian Pan
Software Engineer

An agent with 95% per-step reliability sounds impressive. At 10 steps, you have a 60% chance of success. At 20 steps, it's down to 36%. At 50 steps, you're looking at a coin flip—and that's with a generous 95% estimate. Field data suggests real-world agents fail closer to 20% per action, which means a 100-step task succeeds roughly 0.00002% of the time. This isn't a model quality problem or a prompt engineering problem. It's a compounding math problem, and most teams building agents haven't internalized it yet.

This is the delegation cliff: the point at which adding one more step to an agent's task doesn't linearly increase the chance of failure—it multiplies it.

Token Budget as a Product Constraint: Designing Around Context Limits Instead of Pretending They Don't Exist

· 10 min read
Tian Pan
Software Engineer

Most AI products treat the context limit as an implementation detail to hide from users. That decision looks clean in demos and catastrophic in production. When a user hits the limit mid-task, one of three things happens: the request throws a hard error, the model silently starts hallucinating because critical earlier context was dropped, or the product resets the session and destroys all accumulated state. None of these are acceptable outcomes for a product you're asking people to trust with real work.

The token budget isn't a quirk to paper over. It's a first-class product constraint that belongs in your design process the same way memory limits belong in systems programming. The teams that ship reliable AI features have stopped pretending the ceiling doesn't exist.

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

Write Amplification in Agentic Systems: Why One Tool Call Hits Six Databases

· 10 min read
Tian Pan
Software Engineer

When an agent decides to remember something — "the user prefers email over Slack" — it looks like a single write. In practice, it is six writes: a new embedding in the vector store, a row in the relational database, an entry in the session cache, a record in the event log, an entry in the audit trail, and an update to the context store. Each one happens because a different part of the system has a legitimate need for the data, and each one introduces a new failure surface.

This is write amplification at the infrastructure layer, and it's one of the quieter operational crises in production agent deployments. It does not cause dramatic failures. It causes partial failures: the user's preference is searchable semantically but the relational query returns stale data; the audit log shows an action that never fully completed; the cache is warm but the context store wasn't updated, so the next session starts without the learned pattern.

Understanding why this happens — and what to do about it — requires borrowing from database internals rather than the agent framework documentation.

The Agent Test Pyramid: Why the 70/20/10 Split Breaks Down for Agentic AI

· 12 min read
Tian Pan
Software Engineer

Every engineering organization that graduates from "we have a chatbot" to "we have an agent" hits the same wall: their test suite stops making sense.

The classical test pyramid — 70% unit tests, 20% integration tests, 10% end-to-end — is built on three foundational assumptions: units are cheap to run, isolated from external systems, and deterministic. Agentic AI systems violate all three at once. A "unit" is a model call that costs tokens and returns different answers each time. An end-to-end run can take several minutes and burn through API budget that a junior engineer's entire sprint's tests couldn't justify. And isolation is nearly impossible when the agent's intelligence emerges precisely from interacting with external tools and state.

Agentic Audit Trails: What Compliance Looks Like When Decisions Are Autonomous

· 12 min read
Tian Pan
Software Engineer

When a human loan officer denies an application, there is a name attached to that decision. That officer received specific information, deliberated, and acted. The reasoning may be imperfect, but it is attributable. There is someone to call, question, and hold accountable.

When an AI agent denies that same application, there is a database row. The row says the decision was made. It does not say why, or what inputs drove it, or which version of the model was running, or whether the system prompt had been quietly updated two weeks prior. When your compliance team hands that row to a regulator, the regulator is not satisfied.

This is the agentic audit trail problem, and most engineering teams building on AI agents have not solved it yet.

The Annotation Pipeline Is Production Infrastructure

· 11 min read
Tian Pan
Software Engineer

Most teams treat their annotation pipeline the same way they treat their CI script from 2019: it works, mostly, and nobody wants to touch it. A shared spreadsheet with color-coded rows. A Google Form routing tasks to a Slack channel. Three contractors working asynchronously, comparing notes in a thread.

Then a model ships with degraded quality, an eval regresses in a confusing direction, and the post-mortem eventually surfaces the obvious: the labels were wrong, and no one built anything to detect it.

Annotation is not a data problem. It is a software engineering problem. The teams that treat it that way — with queues, schemas, monitoring, and structured disagreement handling — build AI products that improve over time. The teams that don't are in a cycle of re-labeling they can't quite explain.