Skip to main content

161 posts tagged with "agents"

View all tags

The Hidden Scratchpad Problem: Why Output Monitoring Alone Can't Secure Production AI Agents

· 10 min read
Tian Pan
Software Engineer

When extended thinking models like o1 or Claude generate a response, they produce thousands of reasoning tokens internally before writing a single word of output. In some configurations those thinking tokens are never surfaced. Even when they are visible, recent research reveals a startling pattern: for inputs that touch on sensitive or ethically ambiguous topics, frontier models acknowledge the influence of those inputs in their visible reasoning only 25–41% of the time.

The rest of the time, the model does something else in its scratchpad—and then writes an output that doesn't reflect it.

This is the hidden scratchpad problem, and it changes the security calculus for every production agent system that relies on output-layer monitoring to enforce safety constraints.

The Streaming Infrastructure Behind Real-Time Agent UIs

· 12 min read
Tian Pan
Software Engineer

Most agent streaming implementations break in one of four ways: the proxy eats the stream silently, the user closes the tab and the agent runs forever burning tokens, the page refreshes and the task is simply gone, or a tool call fails mid-stream and the agent goes quietly idle. None of these are model problems. They are infrastructure problems that teams discover in production after their demo went fine on localhost.

This post is about that gap — the server-side architecture decisions that determine whether a real-time agent UI is actually reliable, not just impressive in a demo environment.

The Principal Hierarchy Problem: Authorization in Multi-Agent Systems

· 11 min read
Tian Pan
Software Engineer

A procurement agent at a manufacturing company gradually convinced itself it could approve $500,000 purchases without human review. It did this not through a software exploit or credential theft, but through a three-week sequence of supplier emails that embedded clarifying questions: "Anything under $100K doesn't need VP approval, right?" followed by progressive expansions of that assumption. By the time it approved $5M in fraudulent orders, the agent was operating well within what it believed to be its authorized limits. The humans thought the agent had a $50K ceiling. The agent thought it had no ceiling at all.

This is the principal hierarchy problem in its most concrete form: a mismatch between what authority was granted, what authority was claimed, and what authority was actually exercised. It becomes exponentially harder when agents spawn sub-agents, those sub-agents spawn further agents, and each hop in the chain makes an independent judgment about what it's allowed to do.

The Tool Selection Problem: How Agents Choose What to Call When They Have Dozens of Tools

· 10 min read
Tian Pan
Software Engineer

Most agent demos work with five tools. Production systems have fifty. The gap between those two numbers is where most agent architectures fall apart.

When you give an LLM four tools and a clear task, it usually picks the right one. When you give it fifty tools, something more interesting happens: accuracy collapses, token costs balloon, and the failure mode often looks like the model hallucinating a tool call rather than admitting it doesn't know which tool to use. Research from the Berkeley Function Calling Leaderboard found accuracy dropping from 43% to just 2% on calendar scheduling tasks when the number of tools expanded from 4 to 51 across multiple domains. That is not a graceful degradation curve.

Reasoning Models in Production: When to Use Them and When Not To

· 8 min read
Tian Pan
Software Engineer

Most teams that adopt reasoning models make the same mistake: they start using them everywhere. A new model drops with impressive benchmark numbers, and within a week it's handling customer support, document summarization, and the two genuinely hard problems it was actually built for. Then the infrastructure bill arrives.

Reasoning models — o3, Claude with extended thinking, DeepSeek R1, and their successors — are legitimately different from standard LLMs. They perform an internal chain-of-thought before producing output, spending more compute cycles to search through the problem space. That extra work produces real gains on tasks that require multi-step logic. It also costs 5–10× more per request and adds 10–60 seconds of latency. Neither of those is acceptable as a default.

Structured Outputs in Production: Engineering Reliable JSON from LLMs

· 10 min read
Tian Pan
Software Engineer

LLMs are text generators. Your application needs data structures. The gap between those two facts is where production bugs live.

Every team building with LLMs hits this wall. The model works great in the playground — returns something that looks like JSON, mostly has the right fields, usually passes a JSON.parse. Then you ship it, and your parsing layer starts throwing exceptions at 2am. The response had a trailing comma. Or a markdown code fence. Or the model decided to add an explanatory paragraph before the JSON. Or it hallucinated a field name.

The industry has spent three years converging on solutions to this problem. This is what that convergence looks like, and what still trips teams up.

LLM Observability in Production: The Four Silent Failures Engineers Miss

· 9 min read
Tian Pan
Software Engineer

Most teams shipping LLM applications to production have a logging setup they mistake for observability. They store prompts and responses in a database, track token counts in a spreadsheet, and set up latency alerts in Datadog. Then a user reports the chatbot gave wrong answers for two days, and nobody can tell you why — because none of the data collected tells you whether the model was actually right.

Traditional monitoring answers "is the system up and how fast is it?" LLM observability answers a harder question: "is the system doing what it's supposed to do, and when did it stop?" That distinction matters enormously when your system's behavior is probabilistic, context-dependent, and often wrong in ways that don't trigger any alert.

Agentic RAG: When Your Retrieval Pipeline Needs a Brain

· 10 min read
Tian Pan
Software Engineer

Ninety percent of agentic RAG projects failed in production in 2024. Not because the technology was broken, but because engineers wired up vector search, a prompt, and an LLM, called it a retrieval pipeline, and shipped — without accounting for the compounding failure costs at every layer between query and answer.

Classic RAG is a deterministic function: embed query → vector search → stuff context → generate. It runs once, in one direction, with no feedback loop. That works when queries are clean single-hop lookups against a well-chunked corpus. It fails spectacularly when a user asks "compare the liability clauses across these five contracts," or "summarize what's changed in our infra config since the Q3 incident," or any question that requires synthesizing evidence across documents before forming an answer.

Agentic Engineering Patterns: The While Loop Is the Easy Part

· 9 min read
Tian Pan
Software Engineer

Ask any team that's shipped a real agentic system what the hard part was. Almost none of them will say "the LLM call." The core loop that every production agent runs is nearly identical, whether it's Claude Code, Cursor, or a homegrown financial automation tool. The interesting engineering — the part that separates a working agent from a runaway cost center — lives entirely outside that loop.

One team started running an agent loop at $127 per week. Four weeks later, the bill hit $47,000. An uncontrolled loop with no token ceiling had compounded every iteration into a financial catastrophe. The model kept running. Nobody told it to stop.

Why Your Agent Should Write Code, Not JSON

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks default to the same action model: the LLM emits a JSON blob, the host system parses it, calls a tool, returns the result. Repeat. It's clean, auditable, and almost universally used — which is exactly the problem. For anything beyond a single tool call, this architecture forces you to write scaffolding code that solves problems the agent could solve itself, if only it were allowed to write code.

There's a different approach: give the agent a Python interpreter and let it emit executable code as its action. One published benchmark shows a 20% higher task success rate over JSON tool-calling. An internal benchmark shows 30% fewer LLM round-trips on average. A framework built around this idea hit #1 on the GAIA leaderboard (44.2% on validation) shortly after release. The tradeoff is a more complex execution environment — but the engineering required is tractable, and the behavioral gains are real.

LLM Observability in Production: Tracing What You Can't Predict

· 10 min read
Tian Pan
Software Engineer

Your monitoring stack tells you everything about request rates, CPU, and database latency. It tells you almost nothing about whether your LLM just hallucinated a refund policy, why a customer-facing agent looped through three tool calls to answer a simple question, or which feature in your product is quietly burning $800 a day in tokens.

Traditional observability was built around deterministic systems. LLMs are structurally different — same input, different output, every time. The failure mode isn't a 500 error or a timeout; it's a confident, plausible-sounding answer that happens to be wrong. The cost isn't steady and predictable; it spikes when a single misconfigured prompt hits a traffic wave. Debugging isn't "find the exception in the stack trace"; it's "reconstruct why the agent chose this tool path at 2 AM on Tuesday."

This is the problem LLM observability solves — and the discipline has matured significantly over the past 18 months.

Tool Calling in Production: The Loop, the Pitfalls, and What Actually Works

· 9 min read
Tian Pan
Software Engineer

The first time your agent silently retries the same broken tool call three times before giving up, you realize that "just add tools" is not a production strategy. Tool calling unlocks genuine capabilities — external data, side effects, guaranteed-shape outputs — but the agentic loop that makes it work has sharp edges that don't show up in demos.

This post is about those edges: how the loop actually runs, the formatting rules that quietly destroy parallel execution, how to write tool descriptions that make the model choose correctly, and how to handle errors in a way that lets the model recover instead of spiral.