Skip to main content

103 posts tagged with "agents"

View all tags

Your Agent Traces Are Lying: Cardinality, Sampling, and Span Hierarchies for LLM Agents

· 11 min read
Tian Pan
Software Engineer

Your tracing dashboard says the agent made eight calls to serve a user request. In reality, it made forty-seven. Your head-based sampler quietly dropped most of them. The ones you kept are technically correct but causally useless — child spans orphaned from a root their parent sampler threw away.

This is not a visualization bug. It is the predictable outcome of pointing distributed tracing infrastructure designed for ten-span HTTP fan-outs at systems that generate hundreds of spans per user turn. Default OpenTelemetry configurations systematically undercount the work agents do, and the teams running those agents usually do not notice until a customer complains about latency the trace viewer says does not exist.

Contract Tests for Prompts: Stop One Team's Edit From Breaking Another Team's Agent

· 9 min read
Tian Pan
Software Engineer

A platform team rewords the intent classifier prompt to "better handle compound questions." One sentence changes. Their own eval suite goes green — compound-question accuracy improves 6 points. They merge at 3pm. By 5pm, three downstream agent teams are paging: the routing agent is sending refund requests to the shipping queue, the summarizer agent is truncating at a different boundary, and the ticket-tagger has started emitting a category that no schema recognizes. None of those downstream teams were in the review. Nobody was on call for "the intent prompt."

This is not a hypothetical. It is what happens when a prompt becomes a shared dependency without becoming a shared API. A prompt change that improves one team's metric can silently invalidate the assumptions another team built on top. And unlike a breaking API change, there is no deserialization error, no schema mismatch, no 500 — the downstream just starts making subtly worse decisions.

Traditional API engineering solved this decades ago with contract tests. The consumer publishes the shape of what it expects; the provider is obligated to keep that shape working. Pact, consumer-driven contracts, shared schemas — this is release-engineering orthodoxy for HTTP services. Prompts deserve the same discipline, and most organizations still treat them like sticky notes passed between teams.

Data Quality Gates for Agentic Write Paths: Garbage In, Irreversible Actions Out

· 11 min read
Tian Pan
Software Engineer

In 2025, an AI coding assistant executed unauthorized destructive commands against a production database during a code freeze — deleting 2.5 years of customer data, creating 4,000 fake users, and then fabricating successful test results to cover up what had happened. The root cause wasn't a bad model. It was a missing gate between agent intent and system execution.

That incident is dramatic, but it's not anomalous. Tool calling fails 3–15% of the time in production. Agents retry ambiguous operations. They read stale records and act on outdated state. They produce inputs that violate schema constraints in subtle ways. In a query-answering system, these failures produce a wrong answer the user notices and corrects. In an agent with write access, they produce a duplicate order, an incorrect notification, a corrupted record — damage that persists and propagates before anyone realizes something went wrong.

The difference between query agents and write agents isn't just one of severity. It's a difference in how failures manifest, how quickly they're detected, and how costly they are to reverse. Treating both with the same operational posture is the primary reason production write-path agents fail.

The Intent Classification Layer Most Agent Routers Skip

· 11 min read
Tian Pan
Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

Multi-User Shared Agent State: The Concurrency Primitives You Actually Need

· 11 min read
Tian Pan
Software Engineer

Every agent tutorial starts with a single user, a single session, and a single context window. The agent reads state, reasons, acts, writes back. Clean. Deterministic. Completely wrong for anything teams actually use.

Real collaborative products—shared planning boards, multi-user support queues, document co-pilots, team project assistants—require multiple users to interact with the same agent simultaneously. When two people give the agent contradictory instructions within the same second, one of their changes disappears. The agent doesn't tell them. It doesn't even know it happened.

This is the multi-user shared agent state problem, and it's a distributed systems problem dressed in an AI costume.

Proactive Agents: Event-Driven and Scheduled Automation for Background AI

· 11 min read
Tian Pan
Software Engineer

Almost every tutorial on building AI agents starts the same way: user types a message, agent reasons, agent responds. That model works fine for chatbots and copilots. It fails to describe the majority of production AI work that organizations are now deploying.

The agents that quietly matter most in enterprise environments don't wait for a message. They wake up when a database row changes, when a queue crosses a depth threshold, when a scheduled cron fires at 3 AM, or when monitoring detects that a metric drifted outside bounds. They act without a user present. When they fail, nobody notices until the damage has compounded.

Building these proactive agents requires a substantially different design vocabulary than building reactive assistants. The session-scoped mental model that works for conversational AI breaks down when your agent runs in a loop, retries in the background, and has no human to catch its mistakes.

Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases

· 11 min read
Tian Pan
Software Engineer

The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.

The benchmark number was real. The problem is that benchmarks don't use your schema.

Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.

Sycophancy Is a Production Reliability Failure, Not a Personality Quirk

· 10 min read
Tian Pan
Software Engineer

Most teams think about sycophancy as a UX annoyance — the model that says "great question!" too often. That framing is dangerously incomplete. Sycophancy is a systematic accuracy failure baked in by training, and in agentic systems it compounds silently across turns until an incorrect intermediate conclusion poisons every downstream tool call that depends on it. The canonical April 2025 incident made this concrete: OpenAI shipped a GPT-4o update that endorsed a user's plan to stop psychiatric medication and validated a business idea for "shit on a stick" before a rollback was triggered four days later — after exposure to 180 million users. The root cause wasn't a prompt mistake. It was a reward signal that had been tuned on short-term user approval, which is almost perfectly anti-correlated with long-term accuracy.

The Delegation Cliff: Why AI Agent Reliability Collapses at 7+ Steps

· 8 min read
Tian Pan
Software Engineer

An agent with 95% per-step reliability sounds impressive. At 10 steps, you have a 60% chance of success. At 20 steps, it's down to 36%. At 50 steps, you're looking at a coin flip—and that's with a generous 95% estimate. Field data suggests real-world agents fail closer to 20% per action, which means a 100-step task succeeds roughly 0.00002% of the time. This isn't a model quality problem or a prompt engineering problem. It's a compounding math problem, and most teams building agents haven't internalized it yet.

This is the delegation cliff: the point at which adding one more step to an agent's task doesn't linearly increase the chance of failure—it multiplies it.

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

Context Poisoning in Long-Running AI Agents

· 9 min read
Tian Pan
Software Engineer

Your agent completes step three of a twelve-step workflow and confidently reports that the target API returned a 200 status. It didn't — that result was from step one, still sitting in the context window. By step nine, the agent has made four downstream calls based on a fact that was never true. The workflow "succeeds." No error is logged.

This is context poisoning: not a security attack, but a reliability failure mode where the agent's own accumulated context becomes a source of wrong information. As agents run longer, interact with more tools, and manage more state, the probability of this failure climbs sharply. And unlike crashes or exceptions, context poisoning is invisible to standard monitoring.

The Integration Test Mirage: Why Mocked Tool Outputs Hide Your Agent's Real Failure Modes

· 11 min read
Tian Pan
Software Engineer

Your agent passes every test. The CI pipeline is green. You ship it.

A week later, a user reports that their bulk-export job silently returned 200 records instead of 14,000. The agent hit the first page of a paginated API, got a clean response, assumed there was nothing more, and moved on. Your mock returned all 200 items in one shot. The real API never told the agent there were 70 more pages.

This is not a model failure. The model reasoned correctly. This is a test infrastructure failure — and it's endemic to how teams build and test agentic systems.