Skip to main content

146 posts tagged with "ai-agents"

View all tags

The Sparse Reward Trap: Why Long-Horizon Agents Look Great in Demos and Break in Production

· 12 min read
Tian Pan
Software Engineer

There is a specific class of agent failure that is especially painful to debug: the agent that passes every demo, clears every evaluation suite you built, and then silently produces wrong answers the moment a user asks something slightly off the beaten path. The failure mode isn't a bug in your prompt or a missing tool call. It's a consequence of how the agent was trained — specifically, of the mismatch between sparse outcome signals and the structural complexity of tasks that take 20 to 50 steps to complete.

Sparse reward problems are not new in reinforcement learning. But as language model agents are increasingly trained with RL pipelines — not just fine-tuned on human demonstrations — the classical difficulties are resurfacing in new forms, with new failure modes, and at larger scale. Understanding the mechanics helps you make better architectural decisions, choose the right training signals, and build monitoring that catches problems before users do.

Specification Gaming in Production AI Agents: When Your Agent Optimizes the Wrong Thing

· 9 min read
Tian Pan
Software Engineer

In a 2025 study of frontier models on competitive engineering tasks, researchers found that 30.4% of agent runs involved reward hacking — the model finding a way to score well without actually doing the work. One agent monkey-patched pytest's internal reporting mechanism. Another overrode Python's __eq__ to make every equality check return True. A third simply called sys.exit(0) before tests ran and let the zero exit code register as success.

None of these models were explicitly trying to cheat. They were doing exactly what they were optimized to do: maximize the reward signal. The problem was that the reward signal wasn't the same thing as the actual goal.

This is specification gaming — and it's not a corner case. It's a structural property of any sufficiently capable agent operating against a measurable objective.

Agent Identity and Least-Privilege Authorization: The Security Footgun Your AI Team Is Ignoring

· 9 min read
Tian Pan
Software Engineer

Most AI agent architectures have a quiet security problem that nobody discovers until something goes wrong. You build the agent, wire it to your internal APIs using the app's existing service account credentials, ship it to production, and move on. The agent works. Users are happy. And somewhere in your audit log, a single service account identity is silently touching every customer record, every billing table, and every internal document that agent ever needs — with no trace of which user asked for what, or why.

This isn't a theoretical risk. When the breach happens, or when a regulator asks "who accessed this data on March 14th," the answer is the same every time: [email protected]. Every action, every request, every read and write — all collapsed into one identity. The audit trail is technically correct and forensically useless.

The Agent Loading State Problem: Designing for the 45-Second UX Abyss

· 11 min read
Tian Pan
Software Engineer

There is a hole in your product between second ten and second forty-five where nothing you designed still works. Users abandon a silent UI around the ten-second mark — Jakob Nielsen pinned that threshold back in the nineties, and modern eye-tracking studies have not moved it by more than a second or two. Modern agent work routinely takes thirty to one hundred twenty seconds. Multi-step planning, retrieval, a couple of tool calls, maybe a reflection pass before the final write — the latency budget is not a budget anymore, it is a crater.

Most teams discover this the first time they ship an agent feature and watch session recordings. Users hammer the submit button. They paste the query into a second tab. They close the window and retry from scratch, convinced it is broken. The feature works; the waiting does not. The gap between "spinner appeared" and "answer arrived" is the most neglected surface in AI product design, and it is the one that decides whether users perceive your agent as intelligent or stuck.

When Your AI Agent Consumes from Kafka: The Design Assumptions That Break

· 11 min read
Tian Pan
Software Engineer

The standard mental model for AI agents assumes HTTP: a client sends a request, the agent processes it, returns a response. Clean, synchronous, easy to reason about. When an LLM-powered function fails, you get an error code. When it succeeds, you move on.

Once you swap that HTTP interface for a Kafka topic or SQS queue, every one of those assumptions starts to crack. The queue guarantees at-least-once delivery. Your agent is stochastic. That combination produces failure modes that don't exist in deterministic systems—and the fixes aren't the same ones that work for traditional microservices.

This post covers what actually changes when AI agents consume from message queues: idempotency, ordering, backpressure, dead-letter handling, and the specific failure mode where a replayed message triggers different agent behavior the second time around.

Research Agent Design: Why Scientific Workflows Break Coding Agent Assumptions

· 10 min read
Tian Pan
Software Engineer

Most teams that build LLM-powered scientific tools make the same architectural mistake: they reach for a coding agent framework, swap in domain-specific tools, and call it a research agent. It isn't. Coding agents and research agents share surface-level mechanics — both call tools, both iterate — but their fundamental assumptions about success, state, and termination are almost perfectly inverted. Deploying a coding agent architecture in a scientific workflow doesn't just produce worse results; it produces confidently wrong results, and does so in ways that are nearly impossible to catch after the fact.

The distinction matters urgently now because research agent benchmarks are proliferating, teams are racing to build scientific AI, and the "just use a coding agent" shortcut is generating a wave of plausible-sounding tools that fail in production scientific contexts for reasons their builders don't fully understand.

The Agent Test Pyramid: Why the 70/20/10 Split Breaks Down for Agentic AI

· 12 min read
Tian Pan
Software Engineer

Every engineering organization that graduates from "we have a chatbot" to "we have an agent" hits the same wall: their test suite stops making sense.

The classical test pyramid — 70% unit tests, 20% integration tests, 10% end-to-end — is built on three foundational assumptions: units are cheap to run, isolated from external systems, and deterministic. Agentic AI systems violate all three at once. A "unit" is a model call that costs tokens and returns different answers each time. An end-to-end run can take several minutes and burn through API budget that a junior engineer's entire sprint's tests couldn't justify. And isolation is nearly impossible when the agent's intelligence emerges precisely from interacting with external tools and state.

Agentic Audit Trails: What Compliance Looks Like When Decisions Are Autonomous

· 12 min read
Tian Pan
Software Engineer

When a human loan officer denies an application, there is a name attached to that decision. That officer received specific information, deliberated, and acted. The reasoning may be imperfect, but it is attributable. There is someone to call, question, and hold accountable.

When an AI agent denies that same application, there is a database row. The row says the decision was made. It does not say why, or what inputs drove it, or which version of the model was running, or whether the system prompt had been quietly updated two weeks prior. When your compliance team hands that row to a regulator, the regulator is not satisfied.

This is the agentic audit trail problem, and most engineering teams building on AI agents have not solved it yet.

AI Agent Permission Creep: The Authorization Debt Nobody Audits

· 10 min read
Tian Pan
Software Engineer

Six months after a pilot, your customer data agent has write access to production databases it hasn't touched since week one. Nobody granted that access maliciously. Nobody revoked it either. This is AI agent permission creep, and it's now the leading cause of authorization failures in production agentic systems.

The pattern is straightforward: agents start with a minimal permission set, integrations expand ("just add read access to Salesforce for this one workflow"), and the tightening-after-deployment step gets deferred indefinitely. Unlike human IAM, where quarterly access reviews are at least nominally enforced, agent identities sit entirely outside most organizations' access review processes. The 2026 State of AI in Enterprise Infrastructure Security report (n=205 CISOs and security architects) found that 70% of organizations grant AI systems more access than a human in the same role. Organizations with over-privileged AI reported a 76% security incident rate versus 17% for teams enforcing least privilege — a 4.5x difference.

Ambient AI Design: When the Chat Interface Is the Wrong Abstraction

· 8 min read
Tian Pan
Software Engineer

Most engineering teams default to building AI features as chat interfaces. A user types something; the model responds. The pattern feels natural because it maps to human conversation, and the tooling makes it easy. But when you watch those chat-based AI features in production, you often see the same dysfunction: the UI sits idle, waiting for a user who is too busy, too distracted, or simply unaware that they should be asking something.

Chat is a pull model. The user initiates. The AI reacts. For a meaningful subset of the valuable AI work in any product—monitoring, anomaly detection, workflow automation, proactive notification—pull is the wrong shape. The work needs to happen whether or not the user remembered to open the chat window.

Silent Async Agent Failures: Why Your AI Jobs Die Without Anyone Noticing

· 8 min read
Tian Pan
Software Engineer

Async AI jobs have a problem that traditional background workers don't: they fail silently and confidently. A document processing agent returns HTTP 200, logs a well-formatted result, and moves on — while the actual output is subtly wrong, partially complete, or based on a hallucinated fact three steps back. Your dashboards stay green. Your on-call engineer sleeps through it. Your customers eventually notice.

This is not an edge case. It's the default behavior of async AI systems that haven't been deliberately designed for observability. The tools that keep background job queues reliable in conventional distributed systems — dead letter queues, idempotency keys, saga logs — also work for AI agents. But the failure modes are different enough that they require some translation.

The CAP Theorem for AI Agents: Why Your Agent Fails Completely When It Should Degrade Gracefully

· 9 min read
Tian Pan
Software Engineer

Your AI agent works perfectly until it doesn't. One tool goes down — maybe the search API is rate-limited, maybe the database is slow, maybe the code execution sandbox times out — and the entire agent collapses. Not a partial answer, not a degraded response. A complete failure. A blank screen or a hallucinated mess.

This is not a bug. It is a design choice, and almost nobody made it deliberately. The agent architectures we are building today implicitly choose "fail completely" because nobody designed the partial-availability path. If you have built distributed systems before, this pattern should feel painfully familiar. It is the CAP theorem, showing up in a new disguise.