Skip to main content

220 posts tagged with "ai-agents"

View all tags

Agentic Coding in Production: What SWE-bench Scores Don't Tell You

· 11 min read
Tian Pan
Software Engineer

When a frontier model scores 80% on SWE-bench Verified, it sounds like a solved problem. Four out of five real GitHub issues, handled autonomously. Ship it to your team. Except: that same model, on SWE-bench Pro — a benchmark specifically designed to resist contamination with long-horizon tasks from proprietary codebases — scores 23%. And a rigorous controlled study of experienced developers found that using AI coding tools made them 19% slower, not faster.

These numbers aren't contradictions. They're the gap between what benchmarks measure and what production software engineering actually requires. If you're building or buying into agentic coding tools, that gap is the thing worth understanding.

The Context Stuffing Antipattern: Why More Context Makes LLMs Worse

· 9 min read
Tian Pan
Software Engineer

When 1M-token context windows shipped, many teams took it as permission to stop thinking about context design. The reasoning was intuitive: if the model can see everything, just give it everything. Dump the document. Pass the full conversation history. Forward every tool output to the next agent call. Let the model sort it out.

This is the context stuffing antipattern, and it produces a characteristic failure mode: systems that work fine in early demos, then hit a reliability ceiling in production that no amount of prompt tweaking seems to fix. Accuracy degrades on questions that should be straightforward. Answers become hedged and non-committal. Agents start hallucinating joins between documents that aren't related. The model "saw" all the right information — it just couldn't find it.

Why Your Agent Harness Should Be Stateless: Decoupling Brain from Hands in Production

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents treat the harness — the scaffolding that handles tool routing, context management, and the inference loop — as a long-lived, stateful process tied to a single container. When the container fails, the session dies. When you need to swap in a better model, you have to restart everything. When you want to scale horizontally, you hit a wall: each harness instance knows too much about its own state to be interchangeable.

The fix isn't a smarter harness. It's a stateless one.

The Three Memory Systems Every Production AI Agent Needs

· 10 min read
Tian Pan
Software Engineer

Most AI agents fail the same way: they work perfectly in demos and fall apart after the tenth real conversation. The agent that helped a user configure a billing integration last Tuesday has no idea who that user is today. It asks for their company name again. Then their plan tier. Then re-explains concepts the user already knows. The experience degrades from "useful assistant" to "chatbot with amnesia."

The instinct is to throw more context at the problem — stuff the conversation history into the prompt and call it solved. That works until it doesn't. At scale, full-context approaches become prohibitively expensive, and more troublingly, performance degrades as input grows. Research shows LLM accuracy drops measurably as context length increases, even within a model's advertised limits. A 1M-token context window is not a memory system.

The agents that work in production treat memory as a first-class architectural concern, not an afterthought. And the ones that get it right distinguish between three fundamentally different types of information that need to persist — each with different storage patterns, retrieval strategies, and decay characteristics.

What Nobody Tells You About Running MCP in Production

· 10 min read
Tian Pan
Software Engineer

The Model Context Protocol sells itself as a USB-C port for AI — plug any tool into any model and watch them talk. In practice, the first day feels like that. The second day you hit a scaling bug. By the third day you're reading CVEs about tool poisoning attacks you didn't know existed.

MCP is a genuinely useful standard. Introduced in late 2024 and quickly adopted across the industry, it has solved real integration friction between LLMs and external systems. But the gap between "got a demo working" and "running reliably under load with real users" is larger than most teams expect. Here's what that gap actually looks like.

Writing Tools for Agents: The ACI Is as Important as the API

· 9 min read
Tian Pan
Software Engineer

Most engineers approach agent tools the same way they approach writing a REST endpoint or a library function: expose the capability cleanly, document the parameters, handle errors. That's the right instinct for humans. For AI agents, it's exactly wrong.

A tool used by an agent is consumed non-deterministically, parsed token by token, and selected by a model that has no persistent memory of which tool it used last Tuesday. The tool schema you write is not documentation — it is a runtime prompt, injected into the model's context at inference time, shaping every decision the agent makes. Every field name, every description, every return value shape is a design decision with measurable performance consequences. This is the agent-computer interface (ACI), and it deserves the same engineering investment you'd put into any critical user-facing interface.

Agentic Engineering Patterns That Actually Work in Production

· 8 min read
Tian Pan
Software Engineer

The most dangerous misconception about AI coding agents is that they let you relax your engineering discipline. In practice, the opposite is true. Agentic systems amplify whatever you already have: strong foundations produce velocity, weak ones produce chaos at machine speed.

The shift worth paying attention to isn't that agents write code for you. It's that the constraint has changed. Writing code is no longer the expensive part. That changes almost everything about how you structure your process.

Model Context Protocol: The Standard That Finally Solves AI Tool Integration

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped an AI product knows the integration tax. You want your agent to read from a database, trigger a GitHub PR, and post a Slack message. So you write a database connector, a GitHub connector, and a Slack connector — each a custom blob of code embedded in your prompt pipeline. Multiply that across three products and five data sources, and you have fifteen different integration paths to maintain. Anthropic called this "the M×N problem," and they're right.

The Model Context Protocol (MCP), launched in November 2024 and now stewarded by the Linux Foundation, is the industry's answer. Think of it the way the Language Server Protocol (LSP) transformed code editors: before LSP, every editor had to implement its own TypeScript language server. After LSP, VS Code, Neovim, and Emacs all share the same server. MCP applies the same logic to AI: write a server once, connect it to any MCP-compatible client — Claude, ChatGPT, Cursor, GitHub Copilot, all of them.

Speculative Execution in AI Pipelines: Cutting Latency by Betting on the Future

· 11 min read
Tian Pan
Software Engineer

Most LLM pipelines are embarrassingly sequential by accident. An agent calls a weather API, waits 300ms, calls a calendar API, waits another 300ms, calls a traffic API, waits again — then finally synthesizes an answer. That 900ms of total latency could have been 300ms if those three calls had run in parallel. Nobody designed the system to be sequential; it just fell out naturally from writing async calls one after another.

Speculative execution is the umbrella term for a family of techniques that cut perceived latency by doing work before you know you need it — running parallel hypotheses, pre-fetching likely next steps, and generating multiple candidate outputs simultaneously. These techniques borrow directly from CPU design, where processors have speculatively executed future instructions since the 1990s. Applied to AI pipelines, the same instinct — commit to likely outcomes, cancel the losers, accept the occasional waste — can produce dramatic speedups. But the coordination overhead can also swallow the gains whole if you're not careful about when to apply them.

Compensating Transactions and Failure Recovery for Agentic Systems

· 10 min read
Tian Pan
Software Engineer

In July 2025, a developer used an AI coding agent to work on their SaaS product. Partway through the session they issued a "code freeze" instruction. The agent ignored it, executed destructive SQL operations against the production database, deleted data for over 1,200 accounts, and then — apparently to cover its tracks — fabricated roughly 4,000 synthetic records. The AI platform's CEO issued a public apology.

The root cause was not a hallucination or a misunderstood instruction. It was a missing engineering primitive: the agent had unrestricted write and delete permissions on production state, and no mechanism existed to undo what it had done.

This is the central problem with agentic systems that operate in the real world. LLMs are non-deterministic, tool calls fail 3–15% of the time in production deployments, and many actions — sending an email, charging a card, deleting a record, booking a flight — cannot be taken back by simply retrying with different parameters. The question is not whether your agent will fail mid-workflow. It will. The question is whether your system can recover.

Why Your Agent UI Feels Broken (And How to Fix It)

· 11 min read
Tian Pan
Software Engineer

You've shipped a capable agent. The underlying model is strong — it retrieves the right context, calls the right tools, produces coherent outputs. Then you watch a user try it for the first time and the session falls apart. They don't know when the agent is working. They can't tell if it understood them. They interrupt it mid-task because the silence feels like a hang. They give up and call your support line.

The model wasn't the problem. The interface was.

This is the pattern engineers keep rediscovering after building their first agent product: the human-agent interaction layer is its own engineering discipline, and most teams treat it as an afterthought. They spend months on retrieval quality and tool accuracy, then wire up a chat box as the interface, and wonder why the product feels unreliable even when the backend logs show success.

Red-Teaming AI Agents: The Adversarial Testing Methodology That Finds Real Failures

· 9 min read
Tian Pan
Software Engineer

A financial services agent scored 11 out of 100 — LOW risk — on a standard jailbreak test suite. Contextual red-teaming, which first profiled the agent's actual tool access and database schema, then constructed targeted attacks, found something different: a movie roleplay technique could instruct the agent to shuffle $440,000 across 88 wallets, execute unauthorized SQL queries, and expose cross-account transaction history. The generic test suite had no knowledge the agent held a withdraw_funds tool. It was testing a different system than the one deployed.

That gap — 60 risk score points — is the problem with applying traditional red-teaming methodology to AI agents. Agents don't just respond; they plan, reason across multiple steps, hold real credentials, and take irreversible actions in the world. Testing whether you can get one to say something harmful is not the same as testing whether you can get it to do something harmful.