Blog

Page 35

12 articles

The Support Ticket to Eval Case Pipeline Nobody Builds
Support tickets are the highest-signal eval dataset most AI teams own, but they rot in Zendesk while the eval suite drifts in Git. Here's the four-stage pipeline that closes the loop.
evalsai-engineering
May 1310 min
Thinking Tokens Are Invisible in Your Logs and Loud on Your Bill
Reasoning tokens get billed as output but live in a field most LLM observability stacks were built before. Here is why finance finds the regressions first, and how to close the gap.
insiderllm-observability
May 139 min
Time-of-Day Quality Drift: Why Your AI Feature Behaves Differently at 10 AM ET
Provider load is not a latency problem with a quality side effect — it is a distribution shift your eval suite never sees, and it ships a feature whose floor your team has not measured.
ai-engineeringevals
May 139 min
The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior
A new optional parameter on an existing tool description ships clean, breaks no callers, fails no evals — and quietly inflates tool call frequency by double digits because the planner's prior shifted. Why tool schemas need semver, frequency baselines, and the same eval discipline as system prompts.
ai-engineeringagents
May 1310 min
Your PRD Is an Untested Prompt — Until You Eval It
A PRD for an AI feature is a system prompt nobody compiled. Run it through an eval before sign-off and the underspecification surfaces before production does.
insiderai-engineering
May 139 min
Agent Circuit Breakers: Why Step Budgets Are Fuses, Not Breakers
Step-count budgets are fuses that blow after the damage is done. Real agent circuit breakers combine semantic loop detection, progress signals, token-velocity ceilings, and halt-with-handoff.
ai-agentsreliability
May 1212 min
Agent Memory Is a Compliance Surface: The Records-Management System You Didn't Sign Up to Build
Long-term memory in agentic products is not a feature — it is a records-management system. Provenance, deletion, audit, and residency obligations land the day the first item is written, and retrofitting them under deadline costs more than building them at design time.
ai-agentscompliance
May 1212 min
AI Code Review Drift: When Your LLM Reviewer's Standards Mutate Faster Than the Code
An LLM code reviewer is not a stable tool — it's a stack of independently drifting components. Here's why your PR bot's catch rate decays silently and what calibration discipline keeps the safety net from thinning.
insiderai-code-review
May 129 min
AI Feature Dependency Graphs: When a Prompt Edit Is a Silent Breaking Change
A prompt edit is a breaking change to every downstream feature that consumes the output. Manifests, live-corpus contract tests, and drift alerts are how teams draw the AI dependency graph before the next outage draws it for them.
insiderai-engineering
May 1212 min
Annotation Drift: How Your Eval Set Stops Measuring the Product You Ship
An eval score that climbs while the product silently decays is a measurement system whose calibration has slipped. Here is how annotation drift hides in plain sight, why both the rubric and the product move under your feet, and the four moves that keep eval numbers honest.
evalsllm-ops
May 1210 min
Asymmetric Eval Economics: Why One Eval Case Costs More Than the Feature It Tests
A single eval case routinely costs more engineering effort than the feature it tests. Why teams underinvest in evals, and why the capex frame fixes it.
insiderevals
May 129 min
Background Agents and the Notification Budget: Why Proactive AI Hits a Hard Ceiling at User Attention
Proactive AI agents collide with a hard daily ceiling of three to five notifications per user. Teams that don't budget attention ship features whose launch metric inverts their retention metric within weeks.
insiderai-agents
May 1210 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 35

The Support Ticket to Eval Case Pipeline Nobody Builds

Thinking Tokens Are Invisible in Your Logs and Loud on Your Bill

Time-of-Day Quality Drift: Why Your AI Feature Behaves Differently at 10 AM ET

The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior

Your PRD Is an Untested Prompt — Until You Eval It

Agent Circuit Breakers: Why Step Budgets Are Fuses, Not Breakers

Agent Memory Is a Compliance Surface: The Records-Management System You Didn't Sign Up to Build

AI Code Review Drift: When Your LLM Reviewer's Standards Mutate Faster Than the Code

AI Feature Dependency Graphs: When a Prompt Edit Is a Silent Breaking Change

Annotation Drift: How Your Eval Set Stops Measuring the Product You Ship

Asymmetric Eval Economics: Why One Eval Case Costs More Than the Feature It Tests

Background Agents and the Notification Budget: Why Proactive AI Hits a Hard Ceiling at User Attention

About Tian Pan