Blog

Page 38

12 articles

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review
Why LLM evals only catch regressions when they live in the PR comment next to the diff. Lessons from how code coverage migrated from nightly job to inline review surface — and the four engineering pieces that turn eval-as-a-job into eval-as-a-merge-gate.
insiderllm
Apr 3011 min
Eval Set Rot: Why Your Score Trends Up While Users Trend Down
Eval scores climb while user complaints climb with them. An eval set built on launch-week traffic quietly stops measuring the product six months later — here is the shadow-set, resampling, and slicing discipline that keeps the dashboard honest.
insiderllm-evals
Apr 3010 min
Hierarchical Memory Compaction: The Four Tiers Your Agent Memory Is Missing
Most LLM agent memory collapses four layers into two — a buffer and a vector store. Working, session, episodic, and semantic each need their own tier.
agent-memoryllm
Apr 3011 min
Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel
Multi-step agents look fast at the median and feel slow in the tail. Here is why composition punishes P50 dashboards, and how to design latency budgets that match what users actually experience.
insideragents
Apr 3010 min
Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring
Inference is 40-60% of an agent's true cost. The other half hides in vector DB, retrieval embeddings, telemetry, retries, evals, and human review — owned by no single team.
insiderai-agents
Apr 3010 min
The Privacy Boundary No One Tests: Why 'Stateless' Tools Are the AI-Era IDOR
How 'stateless' AI tool calls quietly leak data across tenants through shared caches, vector stores, and memory modules — and the audit protocol that catches it before customers do.
insiderai-security
Apr 3010 min
Why Your Prompt Library Should Be a Monorepo, Not a Cookbook
Cookbook-style prompt folders break at scale. Apply monorepo discipline — semantic versioning, dependency graphs, atomic refactors, and eval gates — to keep prompt drift, phantom dependencies, and migration paralysis out of production.
insiderprompt-engineering
Apr 3011 min
Tool Call Ordering Is a Partial Order, Not a Set
Most production agents treat their tool set as an unordered bag of capabilities. It's actually a partial order, and the bugs live in the edges nobody declared.
insideragents
Apr 3010 min
Agents as Cron Jobs: When Scheduled Triggers Beat Conversational Loops
Most production agents are background jobs cosplaying as chats. Here's why scheduled triggers, checkpointed state, and bounded envelopes outperform conversational loops on cost, reliability, and operability.
ai-agentsarchitecture
Apr 2910 min
The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor
Provider model bumps carry no behavioral compatibility guarantee, so every version change should run through the same staged rollout as a database migration — pinned eval, shadow traffic, canary, and a real rollback path.
insiderllm-ops
Apr 2811 min
Abstention as a Routing Decision: Why 'I Don't Know' Belongs in the Router, Not the Prompt
Putting 'I don't know' in the system prompt makes abstention untestable, unowned, and unscalable. Move it to the router and you get an SLO, an eval, and a real escalation path.
llmagents
Apr 2710 min
Agent Credential Blast Radius: The Principal Class Your IAM Model Never Enumerated
Agents inherited the broadest OAuth scopes the platform would issue, then drifted on a prompt — bringing back the privileged service account the security org spent a decade killing. A field guide to per-tool scoping, JIT credentials, action-level audit, and the IAM owner who owns the join.
ai-agentssecurity
Apr 2711 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 38

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

Eval Set Rot: Why Your Score Trends Up While Users Trend Down

Hierarchical Memory Compaction: The Four Tiers Your Agent Memory Is Missing

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

The Privacy Boundary No One Tests: Why 'Stateless' Tools Are the AI-Era IDOR

Why Your Prompt Library Should Be a Monorepo, Not a Cookbook

Tool Call Ordering Is a Partial Order, Not a Set

Agents as Cron Jobs: When Scheduled Triggers Beat Conversational Loops

The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor

Abstention as a Routing Decision: Why 'I Don't Know' Belongs in the Router, Not the Prompt

Agent Credential Blast Radius: The Principal Class Your IAM Model Never Enumerated

About Tian Pan