Blog

Page 57

12 articles

Reading the Agent Stack Trace: Triangulating Failures Across Model, Tool, and Harness
Most agent bugs live in the joints between model, tools, and harness — single-layer logs cannot see them. Build a unified trace, an OpenTelemetry GenAI span surface, a cause-hypothesis panel, and a reproducibility envelope to debug agents like the distributed systems they are.
agent-observabilitydebugging
May 810 min
The Refusal Audit: Why a Single Refusal Rate Hides Half the Failure Distribution
Refusal rate is a two-sided distribution, but most safety dashboards plot only one side. Here is what to instrument, how to sample, and who should own the calibration.
llm-safetyevals
May 810 min
Retrieval Cascade Failure: How Document Deletion Poisons Your RAG Pipeline
When source documents disappear, their embeddings linger in the vector index and keep returning confidently wrong answers. A field guide to tombstones, cascade invalidation, and retrieval-time freshness checks.
ragvector-database
May 89 min
The Session Boundary Problem: Where a Conversation Ends for Billing, Eval, and Memory
One session_id column, three meanings — billing, eval, and memory each define a conversation differently, and a single default ships three unrelated bugs with the same root cause.
insiderllm-ops
May 811 min
The Show Your Work UX Trap: When the Reasoning Trace Is Debug Output Wearing a Product Costume
Most AI features ship with a visible reasoning trace because the model emits one and hiding it feels wasteful. It is a product decision the team never made — and a measurable source of trust loss.
insiderai
May 811 min
Smaller Model, Bigger Bill: Why Cheaper-Per-Token Often Costs More
Switching to a smaller model to cut cost-per-token can quietly raise your LLM bill. The right unit is cost per successful task, and most dashboards never measure it.
llmcost-optimization
May 88 min
The Snapshot Trace Test: Production Traces as Your Regression Suite
Hand-curated LLM eval sets decay the moment user behavior shifts. Pin production traces, assert semantic equivalence on outputs, structural equality on tool calls, and latency bands instead of point estimates.
llm-evalsobservability
May 810 min
The Stop-Sequence Footgun: When User Input Collides With Your Delimiter
Stop sequences chosen for clean engineering examples become silent ambient hazards once user content joins the prompt. How the bug manifests, why eval suites miss it, and the reserved-namespace fix that prevents recurrence.
llmprompt-engineering
May 810 min
Streaming Structured Output: Why Your Parser Hangs on Token 47
Token-streaming and structured output are architecturally at odds. The naive try/catch JSON.parse loop is O(n²), the is_complete boolean is a lie, and partial enums are how a Delete tool fires on DeleteIfEmpty.
ai-engineeringstreaming
May 811 min
The Summary Tax: When Compaction Eats More Tokens Than It Saves
Long-running agents trigger summarization on overflow or hierarchically, and at scale the compaction passes quietly become the dominant inference cost — and the dashboard never tells you.
insiderllm
May 810 min
The Thumbs-Down on the Right Answer: When User Feedback Trains Sycophancy
Thumbs-down ratings mix wrongness with unwelcomeness. Optimizing prompts against the raw signal trains agreement, not accuracy — and the math gets worse with scale.
ai-engineeringevaluation
May 89 min
Token-Aware Logging: When Your Traces Cost More Than the Inference They Observe
Telemetry pipelines for AI agents now eat more budget than the LLM calls they observe. A field-by-field cost model — fingerprinted prompts, outcome-aware sampling, retention tiers — for keeping observability on the right side of the COGS line.
ai-engineeringobservability
May 812 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 57

Reading the Agent Stack Trace: Triangulating Failures Across Model, Tool, and Harness

The Refusal Audit: Why a Single Refusal Rate Hides Half the Failure Distribution

Retrieval Cascade Failure: How Document Deletion Poisons Your RAG Pipeline

The Session Boundary Problem: Where a Conversation Ends for Billing, Eval, and Memory

The Show Your Work UX Trap: When the Reasoning Trace Is Debug Output Wearing a Product Costume

Smaller Model, Bigger Bill: Why Cheaper-Per-Token Often Costs More

The Snapshot Trace Test: Production Traces as Your Regression Suite

The Stop-Sequence Footgun: When User Input Collides With Your Delimiter

Streaming Structured Output: Why Your Parser Hangs on Token 47

The Summary Tax: When Compaction Eats More Tokens Than It Saves

The Thumbs-Down on the Right Answer: When User Feedback Trains Sycophancy

Token-Aware Logging: When Your Traces Cost More Than the Inference They Observe

About Tian Pan