175 posts tagged with "agents"

AI-Native Logging: Capture Decisions, Not Just I/O

May 5, 2026 · 10 min read

Software Engineer

A customer support agent was generating hallucinated troubleshooting steps for 12% of tickets. The HTTP logs showed 200 OK across the board. Latency was normal. Error rates were flat. The system looked healthy by every conventional metric — and it was quietly fabricating answers at scale.

When engineers finally instrumented the decision layer, the root cause emerged in minutes: similarity scores for retrieved chunks were all below 0.4, confidence in the context was 0.28, and yet the model's stated output confidence read 0.91. A massive mismatch — invisible in traditional logs, obvious in a trace that captured the decision state.

This is the fundamental problem with applying conventional logging to LLM systems. I/O logs tell you your system ran. AI-native logging tells you whether it reasoned correctly.

AI as the Permanent Intern: The Role-Task Gap in Enterprise Workflows

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

There's a pattern that appears in nearly every enterprise AI deployment: the tool performs brilliantly in the demo, ships to production, and then quietly stalls at 70–80% of its potential. Teams attribute the stall to model quality, context window limits, or retrieval failures. Most of the time, that diagnosis is wrong. The actual problem is that they're asking the AI to play a role it structurally cannot occupy — not yet, possibly not ever in its current form.

The gap between "AI can do this task" and "AI can play this role" is the most expensive misunderstanding in enterprise AI.

The Boring AI Manifesto: Why a Single Prompt Outperforms Your Autonomous Agent

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Here's an uncomfortable fact: 80% of AI projects fail to deliver business value, yet teams keep reaching for the most complex solution available. A multi-agent orchestration system with tool-calling, memory retrieval, and autonomous planning makes for a compelling demo. A single prompt that routes customer support tickets to the right queue makes your company $2M in the first year. These two outcomes are not equally likely, and they are not equally common, and the industry has been choosing the wrong one.

The pattern is predictable. An engineering team builds something impressive, demos it for leadership, gets approval to ship it — and then watches it silently degrade in production. Meanwhile, a competitor quietly deploys a two-hundred-line Python script wrapping a classifier, never demos it, and outperforms them on every business metric that matters.

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.

Context Compression Artifacts: What Your Summarization Middleware Is Silently Losing

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Your agent said "Do NOT use eval()" at turn three. By turn thirty, it called eval(). Your insurance processor said "Never approve claims without valid ID." After fifteen compression cycles, it approved one. These aren't model failures — they're compression failures. The agent's reasoning was fine. The summarization middleware threw away the one constraint that mattered.

Context compression is now standard infrastructure in long-running agent systems. When conversation history grows too large for the context window, you compress it — roll up older turns into a summary, trim, chunk, or distill. The problem is that modern summarizers don't destroy information randomly. They destroy it predictably, along specific fault lines, and most teams only discover those fault lines in production.

Conversation-Aware Rate Limiting: Why Per-Request Throttling Breaks Multi-Turn AI

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature works in testing. Single-turn Q&A, perfect. Run it in production with a real user sitting in a 10-turn debugging session and it fails — not because the model broke, but because your rate limiter was designed for a completely different world.

The standard API rate limit is a blunt instrument built for stateless REST calls. Each request is treated as an independent, roughly equal unit of consumption. That model works fine for CRUD endpoints where every call is indeed comparable. It falls apart for multi-turn conversations, where each successive turn gets more expensive, a single user interaction can trigger dozens of internal model calls, and a mid-session cutoff is far more damaging than a failed single-shot query ever was.

End-to-End Latency Is Not P99 of Your LLM Call: The Multipliers Nobody Measures in Agentic Systems

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Your LLM API call completes in 500ms at P99. Your users are waiting 12 seconds. Both numbers are accurate, and neither is lying to you — they're just measuring completely different things. The gap between them is where most agentic systems silently bleed performance, and most teams never instrument it.

The problem is structural: P99 LLM latency is a single-call metric applied to a multi-step execution model. A ReAct agent making five sequential tool calls, retrying a hallucinated function, assembling a growing context, and generating a 300-token reasoning chain is not one LLM call. It's a distributed workflow where the LLM is just one node, and every other node has its own latency tax.

Non-Blocking AI: Async UX Patterns That Keep Applications Responsive While Agents Work

May 5, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams discover the synchronous UI problem the same way: a user clicks "Generate report" and the browser tab goes silent for forty seconds. No spinner, no progress, just a frozen button. Half the users hit refresh and submit twice. The other half assume the product is broken and close the tab.

The root issue is not agent latency — it's that LLM-backed agents operate on timescales that break every assumption baked into synchronous request-response UX. A single GPT-4o call averages 8–15 seconds. A multi-step agent that searches the web, reads three documents, writes a draft, then formats the output can take two to four minutes. You cannot make that feel fast by optimizing the agent. You have to redesign the contract between your backend and your UI.

Prompt Contract Testing: How Teams Building Different Agents Coordinate Without Breaking Each Other

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

When two microservices diverge in their API assumptions, your integration tests catch it before production does. When two agents diverge in their prompt assumptions, you find out when a customer gets contradictory answers—or when a cascading failure takes down the entire pipeline. Multi-agent AI systems fail at rates of 41–87% in production. More than a third of those failures aren't model quality problems; they're coordination breakdowns: one agent changed how it formats output, another still expects the old schema, and nobody has a test for that.

The underlying problem is that agents communicate through implicit contracts. A research agent agrees—informally, in someone's mental model—to return results as a JSON object with a sources array. The orchestrating agent depends on that shape. Nobody writes this down. Nobody tests it. Six weeks later the research agent's prompt is refined to return a ranked list instead, and the orchestrator silently drops half its inputs.

Timeout-Aware Agent Design: How to Deliver Partial Results Instead of Silent Failure

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

An agent successfully creates a GitHub issue, opens a Jira ticket, and updates a shared spreadsheet. Then it times out before sending the Slack announcement. The framework records the run as delivered. The user never gets notified. The side effects exist in three systems; the result that matters to the human doesn't.

This is the most common timeout failure mode in production agent systems, and it's almost never the one teams prepare for. Most agent implementations treat a timeout like any other exception: catch it, log it, return an error. The user gets nothing, even though the agent completed 90% of the work. The question isn't whether to set timeouts — every production system needs them. The question is what an agent does when the clock runs out.

The Agentic Debugger's Trap: When Your Agent Patches Faster Than You Can Diagnose

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

A staff engineer I worked with last quarter caught a bug that had already been "fixed" three times in the previous six weeks. Three different engineers. Three different files. Three green CI runs. Three accepted agent-generated patches. Each patch made the failing test pass and the user-reported error disappear. Each one moved the bug somewhere else, where it waited until a different surface area triggered it again. The fourth time it surfaced, the data corruption it caused had been silently compounding for forty days.

The bug was a single off-by-one in a pagination cursor. The agent had been right that the symptom would go away. It had been wrong about why. And the engineers — competent, senior, well-intentioned — had each accepted a passing patch before they understood the failure mechanism.

This is the agentic debugger's trap: your agent can produce a fix faster than you can build the mental model needed to evaluate whether the fix is correct. Patch velocity outruns diagnosis. The bug count drops, the CI dashboard goes green, and you ship a codebase whose failure modes you no longer understand.

The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles

May 2, 2026 · 11 min read

Tian Pan

Software Engineer

A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.

This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.

About Tian Pan