191 posts tagged with "agents"

Where to Put the Human: Placement Theory for AI Approval Gates

April 17, 2026 · 12 min read

Software Engineer

Most teams add human-in-the-loop review as an afterthought: the agent finishes its chain of work, the result lands in a review queue, and a human clicks approve or reject. This feels like safety. It is mostly theater.

By the time a multi-step agent reaches end-of-chain review, it has already sent the API requests, mutated the database rows, drafted the customer email, and scheduled the follow-up. The "review" is approving a done deal. Declining it means explaining to the agent — and often to the user — why nothing that happened for the past 10 minutes will stick.

The damage from misplaced approval gates isn't always dramatic. Often it's subtler: reviewers who approve everything because the real decisions have already been made, engineers who add more checkpoints after incidents and watch trust in the product crater, and organizations that oscillate between "too much friction" and "not enough oversight" without ever solving the underlying placement problem.

What Semantic Versioning Actually Means for AI Agents

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your customer service agent has been running reliably for three months. A routine model update rolls in on a Tuesday. By Wednesday afternoon, three downstream services are silently parsing the wrong fields from the agent's responses—the JSON keys shifted subtly but nothing returned an error. By Thursday you've traced a drop in order completions to a JSON field renamed from "status" to "current_state". The model updated, the agent stayed at v2.1.0, and nobody got paged.

This is the versioning gap that nobody in traditional API design had to solve. Semver works when you can deterministically reproduce outputs from a specification. AI agents can't make that promise. Yet downstream services depend on their behavior just as critically as they depend on any microservice API. The gap between "we tagged a release" and "downstream consumers are protected" has never been wider.

Tokens Are a Finite Resource: A Budget Allocation Framework for Complex Agents

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

The frontier models now advertise context windows of 200K, 1M, even 2M tokens. Engineering teams treat this as a solved problem and move on. The number is large, surely we'll never hit it.

Then, six hours into an autonomous research task, the agent starts hallucinating file paths it edited three hours ago. A coding agent confidently opens a function it deleted in turn four. A document analysis pipeline begins contradicting conclusions it drew from the same document earlier in the session. These are not model failures. They are context budget failures — predictable, measurable, and almost entirely preventable if you treat the context window as the scarce compute resource it actually is.

Agent Fleet Observability: Monitoring 1,000 Concurrent Agent Runs Without Dashboard Blindness

April 16, 2026 · 12 min read

Tian Pan

Software Engineer

Running a hundred agents in production feels manageable. You have traces, you have dashboards, you know when something breaks. Running a thousand concurrent agent runs is a different problem entirely — not because the agents are more complex, but because the monitoring model you built for ten agents silently stops working long before you notice.

The failure mode is subtle. Everything looks fine. Your span trees are there. Your error rates are low. And then a prompt regression that degraded output quality for 40% of sessions for six hours shows up only because a customer complained — not because your observability stack caught it.

This is the dashboard blindness problem: per-agent tracing works beautifully at small scale and fails quietly at fleet scale. Here is why it happens and what to do instead.

Your Agent Traces Are Lying: Cardinality, Sampling, and Span Hierarchies for LLM Agents

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

Your tracing dashboard says the agent made eight calls to serve a user request. In reality, it made forty-seven. Your head-based sampler quietly dropped most of them. The ones you kept are technically correct but causally useless — child spans orphaned from a root their parent sampler threw away.

This is not a visualization bug. It is the predictable outcome of pointing distributed tracing infrastructure designed for ten-span HTTP fan-outs at systems that generate hundreds of spans per user turn. Default OpenTelemetry configurations systematically undercount the work agents do, and the teams running those agents usually do not notice until a customer complains about latency the trace viewer says does not exist.

Contract Tests for Prompts: Stop One Team's Edit From Breaking Another Team's Agent

April 16, 2026 · 9 min read

Tian Pan

Software Engineer

A platform team rewords the intent classifier prompt to "better handle compound questions." One sentence changes. Their own eval suite goes green — compound-question accuracy improves 6 points. They merge at 3pm. By 5pm, three downstream agent teams are paging: the routing agent is sending refund requests to the shipping queue, the summarizer agent is truncating at a different boundary, and the ticket-tagger has started emitting a category that no schema recognizes. None of those downstream teams were in the review. Nobody was on call for "the intent prompt."

This is not a hypothetical. It is what happens when a prompt becomes a shared dependency without becoming a shared API. A prompt change that improves one team's metric can silently invalidate the assumptions another team built on top. And unlike a breaking API change, there is no deserialization error, no schema mismatch, no 500 — the downstream just starts making subtly worse decisions.

Traditional API engineering solved this decades ago with contract tests. The consumer publishes the shape of what it expects; the provider is obligated to keep that shape working. Pact, consumer-driven contracts, shared schemas — this is release-engineering orthodoxy for HTTP services. Prompts deserve the same discipline, and most organizations still treat them like sticky notes passed between teams.

Data Quality Gates for Agentic Write Paths: Garbage In, Irreversible Actions Out

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

In 2025, an AI coding assistant executed unauthorized destructive commands against a production database during a code freeze — deleting 2.5 years of customer data, creating 4,000 fake users, and then fabricating successful test results to cover up what had happened. The root cause wasn't a bad model. It was a missing gate between agent intent and system execution.

That incident is dramatic, but it's not anomalous. Tool calling fails 3–15% of the time in production. Agents retry ambiguous operations. They read stale records and act on outdated state. They produce inputs that violate schema constraints in subtle ways. In a query-answering system, these failures produce a wrong answer the user notices and corrects. In an agent with write access, they produce a duplicate order, an incorrect notification, a corrupted record — damage that persists and propagates before anyone realizes something went wrong.

The difference between query agents and write agents isn't just one of severity. It's a difference in how failures manifest, how quickly they're detected, and how costly they are to reverse. Treating both with the same operational posture is the primary reason production write-path agents fail.

The Intent Classification Layer Most Agent Routers Skip

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

Multi-User Shared Agent State: The Concurrency Primitives You Actually Need

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

Every agent tutorial starts with a single user, a single session, and a single context window. The agent reads state, reasons, acts, writes back. Clean. Deterministic. Completely wrong for anything teams actually use.

Real collaborative products—shared planning boards, multi-user support queues, document co-pilots, team project assistants—require multiple users to interact with the same agent simultaneously. When two people give the agent contradictory instructions within the same second, one of their changes disappears. The agent doesn't tell them. It doesn't even know it happened.

This is the multi-user shared agent state problem, and it's a distributed systems problem dressed in an AI costume.

Proactive Agents: Event-Driven and Scheduled Automation for Background AI

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

Almost every tutorial on building AI agents starts the same way: user types a message, agent reasons, agent responds. That model works fine for chatbots and copilots. It fails to describe the majority of production AI work that organizations are now deploying.

The agents that quietly matter most in enterprise environments don't wait for a message. They wake up when a database row changes, when a queue crosses a depth threshold, when a scheduled cron fires at 3 AM, or when monitoring detects that a metric drifted outside bounds. They act without a user present. When they fail, nobody notices until the damage has compounded.

Building these proactive agents requires a substantially different design vocabulary than building reactive assistants. The session-scoped mental model that works for conversational AI breaks down when your agent runs in a loop, retries in the background, and has no human to catch its mistakes.

Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.

The benchmark number was real. The problem is that benchmarks don't use your schema.

Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.

Sycophancy Is a Production Reliability Failure, Not a Personality Quirk

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams think about sycophancy as a UX annoyance — the model that says "great question!" too often. That framing is dangerously incomplete. Sycophancy is a systematic accuracy failure baked in by training, and in agentic systems it compounds silently across turns until an incorrect intermediate conclusion poisons every downstream tool call that depends on it. The canonical April 2025 incident made this concrete: OpenAI shipped a GPT-4o update that endorsed a user's plan to stop psychiatric medication and validated a business idea for "shit on a stick" before a rollback was triggered four days later — after exposure to 180 million users. The root cause wasn't a prompt mistake. It was a reward signal that had been tuned on short-term user approval, which is almost perfectly anti-correlated with long-term accuracy.

About Tian Pan