191 posts tagged with "agents"

Tool Call Convergence: Designing Agents That Know When to Stop

May 7, 2026 · 10 min read

Software Engineer

A LangChain analyzer/verifier agent pair ran for 264 hours straight and racked up $47,000 in API costs. It produced nothing useful. The verifier kept rejecting the analyzer's output without saying what was wrong. The analyzer defaulted to trying again. No one had written a stopping criterion. The loop ran until someone noticed the invoice.

This is the failure mode that doesn't make it into architecture diagrams: agents that know how to call tools but don't know when to stop. The canonical agent loop is a while True that asks the model "should I call a tool?" — but that question has no built-in answer for "I've seen enough." Without convergence logic, you're not building an agent. You're building an expensive polling function.

Agent Memory Contamination: How One Bad Tool Response Poisons a Whole Session

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Your agent completes 80% of a multi-step research task correctly, then confidently delivers a conclusion that's completely wrong. You trace back through the logs and find the culprit at step three: a tool call returned stale data, the agent integrated that data as fact, and every subsequent reasoning step built on that poisoned premise. By the end of the session, the agent was correct about everything except the thing that mattered.

This is agent memory contamination — and it's one of the most insidious reliability failures in production agentic systems. Unlike a crash or timeout, it produces a confident wrong answer. Observability tooling records a successful run. The user walks away with bad information.

Agentic Systems Are Distributed Systems: Apply Microservices Lessons Before You Learn Them the Hard Way

May 5, 2026 · 12 min read

Tian Pan

Software Engineer

The failure rates for multi-agent AI systems in production are embarrassing. A landmark study analyzing over 1,600 execution traces across seven popular frameworks found failure rates ranging from 41% to 87%. Carnegie Mellon researchers put leading agent systems at 30–35% task completion on multi-step benchmarks. Gartner is predicting 40% of agentic AI projects will be cancelled by the end of 2027.

Here is the uncomfortable truth: these aren't AI problems. They're distributed systems problems that engineers already solved between 2010 and 2018, documented exhaustively in blog posts, conference talks, and eventually in Martin Kleppmann's Designing Data-Intensive Applications. The teams that are shipping reliable agent systems today aren't doing anything magical — they're applying circuit breakers, bulkheads, event sourcing, and idempotency keys. The teams that are failing are treating agents as a new paradigm when they're a new deployment target for old patterns.

AI-Native API Design: Building Backends That Agents Can Actually Use

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Your REST API works fine. Documentation is thorough. Error codes are consistent. Every human-authored client you've ever tested handles it well. Then your team integrates an AI agent and within an hour it's generated 2,000 failed requests by retrying variations of an endpoint that doesn't exist — bulk_search_users, search_all_users, bulk_user_search — each attempt triggering real downstream processing.

This isn't a prompt engineering failure. It's an API design failure.

REST APIs were built for clients that parse documentation, respect contracts, and call exactly what's specified. AI agents are different: they reason about what an endpoint probably does based on names and descriptions, retry without tracking state, and treat error messages as instructions rather than diagnostic codes. Designing an API for an agentic caller requires rethinking assumptions that most backend engineers have never had to question.

AI-Native Logging: Capture Decisions, Not Just I/O

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

A customer support agent was generating hallucinated troubleshooting steps for 12% of tickets. The HTTP logs showed 200 OK across the board. Latency was normal. Error rates were flat. The system looked healthy by every conventional metric — and it was quietly fabricating answers at scale.

When engineers finally instrumented the decision layer, the root cause emerged in minutes: similarity scores for retrieved chunks were all below 0.4, confidence in the context was 0.28, and yet the model's stated output confidence read 0.91. A massive mismatch — invisible in traditional logs, obvious in a trace that captured the decision state.

This is the fundamental problem with applying conventional logging to LLM systems. I/O logs tell you your system ran. AI-native logging tells you whether it reasoned correctly.

AI as the Permanent Intern: The Role-Task Gap in Enterprise Workflows

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

There's a pattern that appears in nearly every enterprise AI deployment: the tool performs brilliantly in the demo, ships to production, and then quietly stalls at 70–80% of its potential. Teams attribute the stall to model quality, context window limits, or retrieval failures. Most of the time, that diagnosis is wrong. The actual problem is that they're asking the AI to play a role it structurally cannot occupy — not yet, possibly not ever in its current form.

The gap between "AI can do this task" and "AI can play this role" is the most expensive misunderstanding in enterprise AI.

The Boring AI Manifesto: Why a Single Prompt Outperforms Your Autonomous Agent

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Here's an uncomfortable fact: 80% of AI projects fail to deliver business value, yet teams keep reaching for the most complex solution available. A multi-agent orchestration system with tool-calling, memory retrieval, and autonomous planning makes for a compelling demo. A single prompt that routes customer support tickets to the right queue makes your company $2M in the first year. These two outcomes are not equally likely, and they are not equally common, and the industry has been choosing the wrong one.

The pattern is predictable. An engineering team builds something impressive, demos it for leadership, gets approval to ship it — and then watches it silently degrade in production. Meanwhile, a competitor quietly deploys a two-hundred-line Python script wrapping a classifier, never demos it, and outperforms them on every business metric that matters.

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.

Context Compression Artifacts: What Your Summarization Middleware Is Silently Losing

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Your agent said "Do NOT use eval()" at turn three. By turn thirty, it called eval(). Your insurance processor said "Never approve claims without valid ID." After fifteen compression cycles, it approved one. These aren't model failures — they're compression failures. The agent's reasoning was fine. The summarization middleware threw away the one constraint that mattered.

Context compression is now standard infrastructure in long-running agent systems. When conversation history grows too large for the context window, you compress it — roll up older turns into a summary, trim, chunk, or distill. The problem is that modern summarizers don't destroy information randomly. They destroy it predictably, along specific fault lines, and most teams only discover those fault lines in production.

Conversation-Aware Rate Limiting: Why Per-Request Throttling Breaks Multi-Turn AI

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature works in testing. Single-turn Q&A, perfect. Run it in production with a real user sitting in a 10-turn debugging session and it fails — not because the model broke, but because your rate limiter was designed for a completely different world.

The standard API rate limit is a blunt instrument built for stateless REST calls. Each request is treated as an independent, roughly equal unit of consumption. That model works fine for CRUD endpoints where every call is indeed comparable. It falls apart for multi-turn conversations, where each successive turn gets more expensive, a single user interaction can trigger dozens of internal model calls, and a mid-session cutoff is far more damaging than a failed single-shot query ever was.

End-to-End Latency Is Not P99 of Your LLM Call: The Multipliers Nobody Measures in Agentic Systems

May 5, 2026 · 9 min read

Tian Pan

Software Engineer

Your LLM API call completes in 500ms at P99. Your users are waiting 12 seconds. Both numbers are accurate, and neither is lying to you — they're just measuring completely different things. The gap between them is where most agentic systems silently bleed performance, and most teams never instrument it.

The problem is structural: P99 LLM latency is a single-call metric applied to a multi-step execution model. A ReAct agent making five sequential tool calls, retrying a hallucinated function, assembling a growing context, and generating a 300-token reasoning chain is not one LLM call. It's a distributed workflow where the LLM is just one node, and every other node has its own latency tax.

Non-Blocking AI: Async UX Patterns That Keep Applications Responsive While Agents Work

May 5, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams discover the synchronous UI problem the same way: a user clicks "Generate report" and the browser tab goes silent for forty seconds. No spinner, no progress, just a frozen button. Half the users hit refresh and submit twice. The other half assume the product is broken and close the tab.

The root issue is not agent latency — it's that LLM-backed agents operate on timescales that break every assumption baked into synchronous request-response UX. A single GPT-4o call averages 8–15 seconds. A multi-step agent that searches the web, reads three documents, writes a draft, then formats the output can take two to four minutes. You cannot make that feel fast by optimizing the agent. You have to redesign the contract between your backend and your UI.

About Tian Pan