Skip to main content

311 posts tagged with "ai-agents"

View all tags

Agent-Friendly APIs: What Backend Engineers Get Wrong When AI Becomes the Client

· 11 min read
Tian Pan
Software Engineer

In 2024, automated bot traffic surpassed human traffic on the internet for the first time. Gartner projects that more than 30% of new API demand by 2026 will come from AI agents and LLM tools. And yet only 24% of organizations explicitly design APIs with AI clients in mind.

That gap is where production systems break. Not because the LLMs are bad, but because APIs built for human developers have assumptions baked in that silently fail when an autonomous agent is the caller. The agent can't ask for clarification, can't read a doc site, and can't decide on its own whether a 422 means "fix your request" or "try again in a few seconds."

This post is for the backend engineer who just found out their service is being called by an AI agent — or who is about to build one that will be.

Agent State as Event Stream: Why Immutable Event Sourcing Beats Internal Agent Memory

· 10 min read
Tian Pan
Software Engineer

An agent misbehaves at 3:47 AM on a Tuesday. It deletes files it shouldn't have, or calls an API with the wrong parameters, or confidently takes an irreversible action based on information that was stale by six hours. You pull up your logs. You can see what the agent did. What you cannot see — what almost no agent framework gives you — is what the agent believed when it made that decision. The state that drove the choice is gone, overwritten by every subsequent step. You're debugging the present to understand the past, and that's an architecture problem, not a logging problem.

Most AI agents treat state as mutable in-memory data: a dictionary that gets updated in place, a database row that gets overwritten, a scratch pad that shrinks and grows. This works fine for simple, short-lived tasks. It collapses under the three pressures that define serious production deployments: debugging complex failures, coordinating across distributed agents, and satisfying compliance requirements. Event sourcing — treating every state change as an immutable, append-only event — solves all three problems at once, and it does it in a way that makes agents structurally more debuggable, not just more logged.

How Agents Teach Themselves: The Closed-Loop Self-Improvement Architecture

· 11 min read
Tian Pan
Software Engineer

The most expensive part of training an agent isn't GPU time. It's the human annotators who label whether a multi-step task succeeded or failed. A single expert annotation of a long-horizon agentic trajectory — verifying that an agent correctly booked a flight, wrote a functional program, or filled out a legal form — can cost more than thousands of inference calls. Closed-loop self-improvement is the architectural pattern that eliminates this bottleneck by replacing human judgment with an automated verifier, then using that verifier to run the generate-attempt-verify-train cycle without any human in the loop. When done correctly, it works: a recent NeurIPS paper showed the pattern doubled average task success rates across multi-turn tool-use environments, going from 12% to 23.5%, without a single human annotation.

The key insight isn't that the model improves itself — it's that the verifier is free. Code execution returns a pass/fail signal deterministically, in milliseconds, at near-zero marginal cost. When your tasks have checkable outcomes, you can run thousands of training episodes per hour with ground-truth labels the model cannot fake (assuming your sandbox is designed correctly). That assumption is doing a lot of work, and we'll come back to it.

The Cold Start Tax on Serverless AI Agents

· 11 min read
Tian Pan
Software Engineer

A standard Lambda function with a thin Python handler cold-starts in about 250ms. Your AI agent, calling the same runtime with a few SDK imports added, cold-starts in 8–12 seconds. Add local model inference and you're at 40–120 seconds. The first user to hit a scaled-down deployment waits the length of a TV commercial before the agent responds. That gap — not latency per inference token, not throughput, but the initial startup cost — is where most serverless AI deployments quietly fail their users.

The problem isn't unique to serverless, but serverless makes it visible. When you run agents on always-on infrastructure, you pay for idle capacity and cold starts never happen. When you embrace scale-to-zero to cut costs, every period of low traffic becomes a trap waiting for the next request.

Computer Use Agents in Production: When Pixels Replace API Calls

· 9 min read
Tian Pan
Software Engineer

Most AI agents interact with the world through structured APIs — clean JSON in, clean JSON out. But a growing class of agents has abandoned that contract entirely. Computer use agents look at screenshots, reason about what they see, and drive a mouse and keyboard like a human operator. When the only integration surface is a screen, pixels become the API.

This sounds like a party trick until you realize how much enterprise software has no API at all. Legacy ERP systems, internal admin panels, proprietary desktop applications — the GUI is the only interface. For years, robotic process automation (RPA) handled this with brittle, selector-based scripts that shattered whenever a button moved three pixels. Computer use agents promise something different: visual understanding that adapts to UI changes the way a human would.

Domain-Specialized Agent Architectures: Why Generic Agents Underperform in High-Stakes Verticals

· 10 min read
Tian Pan
Software Engineer

A generic AI agent that can summarize a contract, draft a product spec, and write a SQL query is genuinely impressive — until you deploy it into a radiology department and discover it suggests plausible-sounding dosing that contradicts the patient's actual drug allergies. The failure is not a hallucination problem. It's an architecture problem.

The assumption baked into most agent demos is that a sufficiently capable foundation model plus a broad tool set equals a capable agent in any domain. In practice, the gap between that assumption and production reality is where patients get hurt, lawsuits materialize, and experiments produce unreproducible results. Generic agents are a reasonable starting point, not a destination.

The Escalation Protocol: Building Agent-to-Human Handoffs That Don't Lose State

· 11 min read
Tian Pan
Software Engineer

When a support agent receives an AI-to-human handoff with a raw chat transcript, the average time to prepare for resolution is 15 minutes. The agent has to find the customer in the CRM, look up the relevant order, calculate purchase dates, and reconstruct what the AI already determined. When the same handoff arrives as a structured payload — action history, retrieved data, the exact ambiguity that triggered escalation — that prep time drops to 30 seconds.

That 97% reduction in manual work isn't an edge case. It's the difference between escalation protocols that actually support human oversight and ones that just dump context onto whoever happens to be on shift.

Building GDPR-Ready AI Agents: The Compliance Architecture Decisions That Actually Matter

· 10 min read
Tian Pan
Software Engineer

Most teams discover their AI agent has a GDPR problem the wrong way: a data subject files an erasure request, the legal team asks which systems hold that user's data, and the engineering team opens a ticket that turns into a six-month audit. The personal data is somewhere in conversation history, somewhere in the vector store, possibly cached in tool call outputs, maybe embedded in a fine-tuned checkpoint — and nobody mapped any of it.

This isn't a configuration gap. It's an architectural one. The decisions that determine whether your AI system is compliance-ready are made in the first few weeks of building, long before legal comes knocking. This post covers the four structural conflicts that regulated-industry engineers need to resolve before shipping AI agents to production.

How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away

· 11 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.

The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.

The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.

The Long-Horizon Evaluation Gap: Why Your Agent Passes Every Benchmark and Still Fails in Production

· 11 min read
Tian Pan
Software Engineer

A model that scores 75% on SWE-Bench Verified falls below 25% on tasks that take a human engineer hours to complete. The same agent that reliably handles single-turn question answering can spiral into incoherent loops, hallucinate tool outputs, and forget its original goal when asked to coordinate a dozen steps toward an open-ended objective. The gap between benchmark number and production behavior isn't noise—it's structural, and understanding it is the difference between shipping something useful and shipping something that looks good in the demo.

This post is about that gap: why it exists, what specific failure modes emerge in long-horizon tasks that never appear in static evals, and what it takes to build an evaluation harness that actually catches them.

MCP Server Supply Chain Risk: When Your Agent's Tools Become Attack Vectors

· 9 min read
Tian Pan
Software Engineer

In September 2025, an unofficial Postmark MCP server with 1,500 weekly downloads was quietly modified. The update added a single BCC field to its send_email function, silently copying every email to an attacker's address. Users who had auto-update enabled started leaking email content without any visible change in behavior. No error. No alert. The tool worked exactly as expected — it just also worked for someone else.

This is the new shape of supply chain attacks. Not compromised binaries or trojaned libraries, but poisoned tool definitions that AI agents trust implicitly. With over 12,000 public MCP servers indexed across registries and the protocol becoming the default integration layer for AI agents, the MCP ecosystem is recreating every mistake the npm ecosystem made — except the blast radius now includes your agent's ability to read files, send messages, and execute code on your behalf.

The N+1 Query Problem Has Infected Your AI Agent

· 10 min read
Tian Pan
Software Engineer

Your AI agent just made twelve API calls to answer a question that needed two. You didn't notice because there's no EXPLAIN ANALYZE for tool calls, no ORM profiler flagging the issue, and the agent got the right answer anyway — just two seconds late and three times over-budget on tokens.

This is the N+1 query problem, and it has quietly migrated from your database layer into your agent's tool call layer. The bad news: the failure mode is identical to what poisoned web applications in the 2010s. The good news: the solutions from that era port almost directly.