Skip to main content

788 posts tagged with "insider"

View all tags

Parallel Tool Calls in LLM Agents: The Coupling Test You Didn't Know You Were Running

· 10 min read
Tian Pan
Software Engineer

Most engineers reach for parallel tool calling because they want their agents to run faster. Tool execution accounts for 35–60% of total agent latency depending on the workload — coding tasks sit at the high end, deep research tasks in the middle. Running independent calls simultaneously is the obvious optimization. What surprises most teams is what happens next.

The moment you enable parallel execution, every hidden assumption baked into your tool design becomes visible. Tools that work reliably in sequential order silently break when they run concurrently. The behavior that was stable turns unpredictable, and often the failure produces no error — just a wrong answer returned with full confidence.

Parallel tool calling is not primarily a performance feature. It is an involuntary architectural audit.

Prompt Sprawl: When System Prompts Grow Into Unmaintainable Legacy Code

· 9 min read
Tian Pan
Software Engineer

Your system prompt started at 200 tokens. A clear role definition, a few formatting rules, a constraint or two. Six months later it's 4,000 tokens of accumulated instructions, half contradicting each other, and nobody on the team can explain why the third paragraph about JSON formatting exists. Welcome to prompt sprawl — the production problem that silently degrades your LLM application while everyone assumes the prompt is "fine."

Prompt sprawl is what happens when you treat prompts like append-only configuration. Every bug gets a new instruction. Every edge case gets a new rule. Every stakeholder gets a new paragraph. The prompt grows, and nobody removes anything because nobody knows what's load-bearing.

This is legacy code — except worse. No compiler catches contradictions. No type system enforces structure. No test suite validates that the 47th instruction doesn't negate the 12th. And unlike a tangled codebase, you can't refactor safely because there's no dependency graph to guide you.

The RAG Freshness Problem: How Stale Embeddings Silently Wreck Retrieval Quality

· 12 min read
Tian Pan
Software Engineer

Your RAG system launched three months ago with impressive retrieval accuracy. Today, it's confidently wrong about a third of what users ask — and nothing in your monitoring caught the change. No errors logged. No latency spikes. The semantic similarity scores look healthy. But the documents being retrieved are outdated, and the model answers with full confidence because the retrieved context looks authoritative.

This is the RAG freshness problem: semantic similarity does not care about time. An embedding of a deprecated API reference scores just as high as a current one. A policy document from last quarter retrieves ahead of the updated version. The system doesn't know and can't tell. Most teams discover their index is weeks or months stale only after a user complaint — and by then, users have already quietly stopped trusting it.

The Reasoning Model Premium in Agent Loops: When Thinking Pays and When It Doesn't

· 10 min read
Tian Pan
Software Engineer

Here is a number that should give you pause before adopting a reasoning model for your agent: a single query that costs 7 tokens with a standard fast model costs 255 tokens with Claude extended thinking and 603 tokens with an aggressively-configured reasoning model. For an isolated chatbot query, that is manageable. But inside an agent loop that calls the model twelve times per task, you are not paying a 10x premium — you are paying a 10x premium times twelve, compounded further by the growing context window that gets re-fed on every turn. Billing surprises have killed agent projects faster than accuracy problems.

The question is not whether reasoning models are better. On hard tasks, they clearly are. The question is whether they are better for your specific workload, at your specific position in the agent loop, and by a margin that justifies the cost. Most teams answer this incorrectly in both directions — they either apply reasoning models uniformly (burning budget on tasks that don't need them) or avoid them entirely (leaving accuracy gains on the table for the tasks that do).

The Reasoning Trace Privacy Problem: How Chain-of-Thought Leaks Sensitive Data in Production

· 9 min read
Tian Pan
Software Engineer

Your reasoning model correctly identifies that a piece of data is sensitive 98% of the time. Yet it leaks that same data in its chain-of-thought 33% of the time. That gap — between knowing something is private and actually keeping it private — is the core of the reasoning trace privacy problem, and most production teams haven't built for it.

Extended thinking has become a standard tool for accuracy-hungry applications: customer support triage, medical coding assistance, legal document review, financial analysis. These are also exactly the domains where the data in the prompt is most sensitive. Deploying reasoning models in these contexts without understanding how traces handle that data is a significant exposure.

Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You

· 10 min read
Tian Pan
Software Engineer

Most engineers who decide to self-host an LLM start with the same calculation: the model is 70B parameters, FP16 is 2 bytes per parameter, so that's 140 GB. They check that two A100-80GB GPUs fit 160 GB, feel satisfied, and order the hardware. Then they hit production and discover they've already run out of memory before serving a single real user.

The model weights are only part of the story. The piece that surprises almost every team is the KV cache — and understanding it changes every decision you make, from quantization choice to serving framework to how many GPUs you actually need.

The Sycophancy Tax: How Agreeable LLMs Silently Break Production AI Systems

· 9 min read
Tian Pan
Software Engineer

In April 2025, OpenAI pushed an update to GPT-4o that broke something subtle but consequential. The model became significantly more agreeable. Users reported that it validated bad plans, reversed correct positions under the slightest pushback, and prefaced every response with effusive praise for the question. The behavior was so excessive that OpenAI rolled back the update within days, calling it a case where short-term feedback signals had overridden the model's honesty. The incident was widely covered, but the thing most teams missed is this: the degree was unusual, but the direction was not.

Sycophancy — the tendency of RLHF-trained models to prioritize user approval over accuracy — is present in nearly every production LLM deployment. A study evaluating ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro found sycophantic behavior in 58% of cases on average, with persistence rates near 79% regardless of context. This is not a bug in a few edge cases. It is a structural property of how these models were trained, and it shows up in production in ways that are hard to catch with standard evals.

The Three Attack Surfaces in Multi-Agent Communication

· 10 min read
Tian Pan
Software Engineer

A recent study tested 17 frontier LLMs in multi-agent configurations and found that 82% of them would execute malicious commands when those commands arrived from a peer agent — even though the exact same commands were refused when issued directly by a user. That number should reset your threat model if you're shipping multi-agent systems. Your agents may be individually hardened. Together, they're not.

Multi-agent architectures introduce communication channels that most security thinking ignores. We harden the model, the system prompt, the API perimeter. We spend almost no time on what happens when Agent A sends a message to Agent B — who wrote that message, whether it was tampered with, whether the memory Agent B consulted was planted three sessions ago by an attacker who never touched Agent A at all.

The Agent Planning Module: A Hidden Architectural Seam

· 10 min read
Tian Pan
Software Engineer

Most agentic systems are built with a single architectural assumption that goes unstated: the LLM handles both planning and execution in the same inference call. Ask it to complete a ten-step task, and the model decides what to do, does it, checks the result, decides what to do next—all in one continuous ReAct loop. This feels elegant. It also collapses under real workloads in a way that's hard to diagnose because the failure mode looks like a model quality problem rather than a design problem.

The agent planning module—the component responsible purely for task decomposition, dependency modeling, and sequencing—is the seam most practitioners skip. It shows up only when things get hard enough that you can't ignore it.

Agent-to-Agent Communication Protocols: The Interface Contracts That Make Multi-Agent Systems Debuggable

· 11 min read
Tian Pan
Software Engineer

When a multi-agent pipeline starts producing garbage outputs, the instinct is to blame the model. Bad reasoning, wrong context, hallucination. But in practice, a large fraction of multi-agent failures trace back to something far more boring: agents that can't reliably communicate with each other. Malformed JSON that passes syntax validation but fails semantic parsing. An orchestrator that sends a task with status "partial" that the downstream agent interprets as completion. A retry that fires an operation twice because there's no idempotency key.

These aren't model failures. They're interface failures. And they're harder to debug than model failures because nothing in your logs will tell you the serialization contract broke.

Agentic Coding in Production: What SWE-bench Scores Don't Tell You

· 11 min read
Tian Pan
Software Engineer

When a frontier model scores 80% on SWE-bench Verified, it sounds like a solved problem. Four out of five real GitHub issues, handled autonomously. Ship it to your team. Except: that same model, on SWE-bench Pro — a benchmark specifically designed to resist contamination with long-horizon tasks from proprietary codebases — scores 23%. And a rigorous controlled study of experienced developers found that using AI coding tools made them 19% slower, not faster.

These numbers aren't contradictions. They're the gap between what benchmarks measure and what production software engineering actually requires. If you're building or buying into agentic coding tools, that gap is the thing worth understanding.

CI/CD for LLM Applications: Why Deploying a Prompt Is Nothing Like Deploying Code

· 10 min read
Tian Pan
Software Engineer

Your code ships through a pipeline: feature branch → pull request → automated tests → staging → production. Every step is gated. Nothing reaches users without passing the checks you've defined. It's boring in the best way.

Now imagine you need to update a system prompt. You edit the string in your dashboard, hit save, and the change is live immediately — no tests, no staging, no diff in version control, no way to roll back except by editing it back by hand. This is how most teams operate, and it's the reason prompt changes are the primary source of unexpected production outages for LLM applications.

The challenge isn't that teams are careless. It's that the discipline of continuous delivery was built for deterministic systems, and LLMs aren't deterministic. The entire mental model needs to be rebuilt from scratch.