Skip to main content

266 posts tagged with "llm"

View all tags

Context Engineering: Memory, Compaction, and Tool Clearing for Production Agents

· 10 min read
Tian Pan
Software Engineer

Most production AI agent failures don't happen because the model ran out of context. They happen because the model drifted long before it hit the limit. Forrester has named "agent drift" the silent killer of AI-accelerated development — and Forrester research from 2025 shows that nearly 65% of enterprise AI failures trace back to context drift or memory loss during multi-step reasoning, not raw token exhaustion.

The distinction matters. A hard context limit is clean: the API rejects the request, the agent stops, you get an error you can handle. Context rot is insidious: the model keeps running, keeps generating output, but performance quietly degrades. GPT-4's accuracy drops from 98.1% to 64.1% based solely on where in the context window information is positioned. You don't get an error signal — you get subtly wrong answers.

This post covers the three primary tools for managing context in production agents — compaction, tool-result clearing, and external memory — along with the practical strategies for applying them before your agent drifts.

Effective Context Engineering for AI Agents

· 11 min read
Tian Pan
Software Engineer

Nearly 65% of enterprise AI failures in 2025 traced back to context drift or memory loss during multi-step reasoning — not model capability issues. If your agent is making poor decisions or losing coherence across a long task, the most likely cause is not the model. It is what is sitting in the context window.

The term "context engineering" is proliferating fast, but the underlying discipline is concrete: active, deliberate management of what enters and exits the LLM's context window at every inference step in an agent's trajectory. Not a prompt. A dynamic information architecture that the engineer designs and the agent traverses. The context window functions as RAM — finite, expensive, and subject to thrashing if you don't manage it deliberately.

Mastering AI Agent Observability: Why Your Dashboards Are Lying to You

· 9 min read
Tian Pan
Software Engineer

Your agent is returning HTTP 200s. Latency is within SLA. Error rates are flat. Everything on the dashboard looks green — and your users are getting confidently wrong answers.

This is the core observability gap in AI systems: the metrics that traditionally signal system health are almost entirely irrelevant to whether your agent is actually doing its job. An agent can fluently hallucinate, skip required tools, use stale retrieval results, or reason itself into logical contradictions — all while your monitoring shows zero anomalies. The standard playbook for service observability doesn't transfer to agentic systems, and teams that don't understand this gap ship agents they can't trust, debug, or improve.

The 80% Problem: Why AI Coding Agents Stall and How to Break Through

· 10 min read
Tian Pan
Software Engineer

A team ships 98% more pull requests after adopting AI coding agents. Sounds like a success story — until you notice that review times grew 91% and PR sizes ballooned 154%. The code was arriving faster than anyone could verify it.

This is the 80% problem. AI coding agents are remarkably good at generating plausible-looking code. They stall, or quietly fail, when the remaining 20% requires architectural judgment, edge case awareness, or any feedback loop more sophisticated than "did it compile?" The teams winning with coding agents aren't the ones who prompted most aggressively. They're the ones who built better feedback loops, shorter context windows, and more deliberate workflows.

Systematic Debugging for AI Agents: From Guesswork to Root Cause

· 9 min read
Tian Pan
Software Engineer

When an AI agent fails in production, you rarely know exactly when it went wrong. You see the final output — a hallucinated answer, a skipped step, a tool called with the wrong arguments — but the actual failure could have happened three steps earlier. This is the core debugging problem that software engineering hasn't solved yet: agents execute as a sequence of decisions, and by the time you notice something is wrong, the evidence is buried in a long trace of interleaved LLM calls, tool invocations, and state mutations.

Traditional debugging assumes determinism. You can reproduce the bug, set a breakpoint, inspect the state. Agent debugging breaks all three of those assumptions simultaneously. The same input can produce different execution paths. Reproducing a failure requires capturing the exact context, model temperature, and external state at the moment it happened. And "setting a breakpoint" in a live reasoning loop is not something most agent frameworks even support.

Context Engineering: The Discipline That Matters More Than Prompting

· 9 min read
Tian Pan
Software Engineer

Most engineers building LLM systems spend the first few weeks obsessing over their prompts. They A/B test phrasing, argue about whether to use XML tags or JSON, and iterate on system prompt wording until the model outputs something that looks right. Then they hit production, add real data, memory, and tool calls — and the model starts misbehaving in ways that no amount of prompt tuning can fix. The problem was never the prompt.

The real bottleneck in production LLM systems is context — what information is present in the model's input, in what order, how much of it there is, and whether it's relevant to the decision the model is about to make. Context engineering is the discipline of designing and managing that input space as a first-class system concern. It subsumes prompt engineering the same way software architecture subsumes variable naming: the smaller skill still matters, but it doesn't drive outcomes at scale.

Eval Engineering for Production LLM Systems

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM systems start with the wrong question. They ask "how do I evaluate this?" before understanding what actually breaks. Then they spend weeks building eval infrastructure that measures the wrong things, achieve 90%+ pass rates immediately, and ship products that users hate. The evaluations weren't wrong—they just weren't measuring failure.

Effective eval engineering isn't primarily about infrastructure. It's about developing a precise, shared understanding of what "good" means for your specific system. The infrastructure is almost incidental. In mature LLM teams, 60–80% of development time goes toward error analysis and evaluation—not feature work. That ratio surprises most engineers until they've shipped a broken model to production and spent a week debugging what went wrong.

Multi-Agent Conversation Frameworks: The Paradigm Shift from Pipelines to Talking Agents

· 11 min read
Tian Pan
Software Engineer

A Google DeepMind study published in late 2025 analyzed 180 multi-agent configurations across five architectures and three LLM families. The finding that got buried in the discussion section: unstructured multi-agent networks amplify errors up to 17.2x compared to single-agent baselines. Not fix errors — amplify them. Agents confidently building on each other's hallucinations, creating echo chambers that make each individual model's failure modes dramatically worse.

This is the paradox at the center of multi-agent conversation frameworks. The same property that makes them powerful — agents negotiating, critiquing, delegating, and revising — is what makes them dangerous without careful design. Understanding the difference between conversation-based orchestration and traditional pipeline chaining is the first step toward using either correctly.

Agentic RAG: When Your Retrieval Pipeline Needs a Brain

· 10 min read
Tian Pan
Software Engineer

Ninety percent of agentic RAG projects failed in production in 2024. Not because the technology was broken, but because engineers wired up vector search, a prompt, and an LLM, called it a retrieval pipeline, and shipped — without accounting for the compounding failure costs at every layer between query and answer.

Classic RAG is a deterministic function: embed query → vector search → stuff context → generate. It runs once, in one direction, with no feedback loop. That works when queries are clean single-hop lookups against a well-chunked corpus. It fails spectacularly when a user asks "compare the liability clauses across these five contracts," or "summarize what's changed in our infra config since the Q3 incident," or any question that requires synthesizing evidence across documents before forming an answer.

Token Economics for AI Agents: Cutting Costs Without Cutting Corners

· 10 min read
Tian Pan
Software Engineer

A Shopify-scale merchant assistant handling 10 million conversations per day costs $2.1 million per month without optimization — or $450,000 per month with it. That 78% gap isn't from algorithmic breakthroughs; it's from caching, routing, and a few engineering disciplines that most teams skip until the invoice arrives.

AI agents are not chatbots with extra steps. A single user request triggers planning, tool selection, execution, verification, and often retry loops — consuming roughly 5x more tokens than a direct chat interaction. A ReAct loop running 10 cycles can consume 50x tokens compared to a single pass. At frontier model prices, that math becomes a liability fast.

This post covers the mechanics of where agent costs come from and the concrete techniques — with numbers — that actually move the needle.

AlphaEvolve's Architecture: How Evolutionary Search + LLMs Discovered a Better Matrix Algorithm

· 9 min read
Tian Pan
Software Engineer

In 1969, Volker Strassen published an algorithm to multiply 4×4 matrices using fewer scalar multiplications than the naive approach. For 56 years, no one did better. Then an AI coding agent rewrote it — in production, deployed across Google's global infrastructure — not by being smarter than a human mathematician, but by running a loop: generate a variant, evaluate it, keep what works, repeat.

That loop is the point. The LLM is just one piece. The architecture around it is what made AlphaEvolve work, and understanding that architecture tells you something important about where AI-assisted engineering is heading.

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.