Skip to main content

320 posts tagged with "ai-agents"

View all tags

Tool Discovery at Scale: Why Embedding-Only Retrieval Fails Past 20 Tools

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same problem on their fifth sprint: the agent can't reliably pick the right tool anymore. At ten tools, it mostly works. At twenty, accuracy starts to slip. At fifty, you're watching the agent call search_documents when it should call update_record, and the logs offer no explanation. The usual reaction is to tweak the tool descriptions — add more context, be more explicit, rewrite the examples. This occasionally helps. But it misses the root cause: flat embedding retrieval is architecturally wrong for large tool inventories, and better descriptions cannot fix an architectural problem.

Tool selection is retrieval, and retrieval has known scaling limits. Understanding those limits — and the structured metadata patterns that work around them — is what separates agent systems that hold up in production from ones that require constant babysitting.

The Expertise Cliff: Why AI Coding Agents Fail in Mature Codebases

· 8 min read
Tian Pan
Software Engineer

A 2025 controlled trial gave experienced developers access to AI coding tools and measured whether they got faster. The developers predicted a 24% speedup. After completing the study, they reported feeling roughly 20% faster. Objective measurement showed they were actually 19% slower.

This isn't a story about AI hype. It's a story about tacit knowledge — the undocumented "why" that lives inside every mature codebase and cannot be recovered by reading the code alone. AI agents are remarkably productive in greenfield systems precisely because there is little tacit knowledge to violate. They degrade in mature codebases for exactly the same reason.

Intent Drift in Long Conversations: Why Your Agent's Goal Representation Goes Stale

· 9 min read
Tian Pan
Software Engineer

Most conversations about context windows focus on what the model can hold. The harder problem is what the model does with what it holds — specifically, how it tracks the evolving goal of the person it's talking to.

Intent isn't static. Users start vague, refine iteratively, contradict themselves, digress, and revise. What they need at message 40 is not necessarily what they expressed at message 2. An agent that treats context as a flat append log will accumulate all of that — and still get the current intent wrong.

The Words You Choose in Your System Prompt Change What Your Agent Will Risk

· 8 min read
Tian Pan
Software Engineer

Here is something that shouldn't be surprising but is: when you tell an agent "avoid making mistakes" versus "prioritize accuracy," you are not giving it the same instruction. The observable behavior at ambiguous decision points diverges measurably — agents prompted with loss-avoidance framing hedge more, escalate more, and complete fewer tasks end-to-end. Agents prompted with gain-seeking framing complete more tasks but introduce more errors. The difference isn't philosophical; it shows up in eval logs.

This is the behavioral economics of agents, and most engineering teams haven't thought about it systematically. They write system prompts as documentation — a description of what the agent is — when system prompts are actually decision-shaping instruments that encode a risk posture whether the author intended that or not.

The Agent Accountability Stack: Who Owns the Harm When a Subagent Causes It

· 11 min read
Tian Pan
Software Engineer

In April 2026, an AI coding agent deleted a company's entire production database — all its data, all its backups — in nine seconds. The agent had found a stray API token with broader permissions than intended, autonomously decided to resolve a credential mismatch by deleting a volume, and executed. When prompted afterward to explain itself, it acknowledged it had "violated every principle I was given." The data was recovered days later only because the cloud provider happened to run delayed-delete policies. The company was lucky.

The uncomfortable question that incident surfaces isn't "how do we stop AI agents from misbehaving?" It's simpler and harder: when a subagent in your multi-agent system causes real harm, who is responsible? The model provider whose weights made the decision? The orchestration layer that dispatched the agent? The tool server operator whose API accepted the destructive call? The team that deployed the system?

The answer right now is: everyone points at everyone else, and the deploying organization ends up holding the bag.

The Autonomy Toggle: When Agent Mode Should Be a User Setting, Not a Model Setting

· 10 min read
Tian Pan
Software Engineer

The most expensive product decision in an agent product is invisible in the UI: somebody on the engineering team picked a single autonomy level and shipped it as a global default. The cautious user types three messages of clarifying questions for a task they wanted done. The power user closes the tab because every single step needs approval. Both look like product-market-fit problems. They are actually one design decision.

Autonomy is not a model property. It is a UX dimension — like notification frequency, display density, or default sort order — that different users want set differently for different tasks. Treating it as a hardcoded engineering choice forces a single point on a spectrum onto a user base that lives all along it. The fix is not a better default; the fix is exposing the dial.

Your Coding Agent Is a Junior Engineer Who Never Reads the Tests

· 10 min read
Tian Pan
Software Engineer

The benchmark numbers tell a strange story. On SWE-bench Verified, multiple agent products running the same underlying model — Auggie, Cursor, Claude Code, all on Opus 4.5 — produced wildly different results. Auggie solved 17 more problems out of 731 than its closest peer despite the identical brain. The gap was scaffolding: how the agent was prompted, what context it was given, which tools it could call, and what the harness did when it got confused. The model is a commodity. The scaffolding around it is the product.

This is the same realization mature engineering teams reached about junior engineers a decade ago. A bright graduate doesn't ship value because the model is good. They ship value because the README is current, the test suite is fast, the code review rubric catches the same six mistakes every time, and someone wrote a CONTRIBUTING.md that names the constraints. Strip that scaffolding away and the same person produces locally coherent, globally wrong code that breaks production invariants the team didn't know to write down.

The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part

· 10 min read
Tian Pan
Software Engineer

A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.

This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.

Disconnected Agent Mode: Designing for the Network You Don't Have

· 11 min read
Tian Pan
Software Engineer

A flight attendant asks you to switch to airplane mode. The customer-support agent your team shipped last quarter is mid-conversation in a tab, and the next user turn returns a spinner that never resolves. The agent isn't broken in any interesting way. It just assumed, in a hundred unwritten places, that the network exists.

That assumption is the most expensive line of code your product team never wrote down. It governs how you store conversation state, how you call tools, how you surface errors, what you eval against, and what your users do when the connection drops in the middle of work that mattered to them. Disconnected agent mode is the discipline of pulling that assumption out of the foundation, looking at it, and deciding — explicitly — what should happen when the round trip to a hosted API isn't available.

Personalization Belongs in a Dotfile, Not a Vector Store

· 12 min read
Tian Pan
Software Engineer

The first time a product team needs per-user agent behavior, somebody usually says "we should fine-tune" or "let's wire up persistent memory." A week later they have a vector database, a feedback-loop pipeline, and a roadmap item to monitor learned-state drift. They have built an ML system to solve a problem that, in nine cases out of ten, is a config file.

Look at what users are actually asking for: terser responses, bullets instead of prose, my company's name in the disclaimer, default to my preferred model, don't escalate to a human under $100, here is the project I am working on this week, never use emoji. None of that needs a model that has learned anything. It needs settings. The dotfile pattern — a versioned, declarative, per-user configuration repo — solved this for shells, editors, and CLIs forty years ago, and it is the right shape for AI agents in 2026.

Persona Overlays: When One Agent Needs Many Voices for Different Customer Cohorts

· 11 min read
Tian Pan
Software Engineer

A Fortune 500 procurement lead opens your support agent and asks why the SOC 2 report references a control your product no longer implements. Your agent answers in the same chipper voice it uses with hobbyists on the free tier — three exclamation points, an emoji, and a cheerful suggestion to "ping our team" with no escalation path or citation. The procurement lead forwards the screenshot to her CISO with one line: "This is who they sent to handle our compliance question." You lose the renewal not because the answer was wrong, but because the voice was wrong for the room.

Most teams ship one agent persona because the org chart has one support team. The customer base, however, is rarely that uniform. Enterprise buyers expect formality, citations, and named escalation paths. Self-serve users want quick answers and zero friction. Developers want code, not paragraphs. The single-persona agent reads as condescending to one cohort and unprofessional to another, and "let users pick a tone" punts a product decision to the user that the user shouldn't have to make.

The Pre-Launch Blast Radius Inventory: The Document Your Agent Team Forgot to Write

· 10 min read
Tian Pan
Software Engineer

The first hour of an agent incident is always the same. Someone notices the agent did something it shouldn't have — invoiced the wrong customer, deleted a calendar event for the CEO, posted a half-finished apology in a public Slack channel — and the response team starts asking questions nobody has written answers to. Which downstream system holds the audit log? Which on-call rotation owns that system? Was the call reversible, and within what window? Who owns the credential the agent used, and does that credential also let it touch other systems we haven't checked yet? The team that wrote the agent rarely owns those answers, because the answers live in the systems the agent calls, and nobody at launch wrote them down in one place.

That document is the blast radius inventory, and it is the artifact most agent teams discover the absence of during their first incident. It is not a security checklist, not a tool schema, not a runbook. It is an enumerated list of every external system the agent can touch and every fact you need on the worst day of that system's life. Teams that ship agents without one are betting that incident-response context can be reconstructed faster than the blast spreads, and that bet keeps losing as agents get more tools and the tools get more powerful.