Blog

Page 47

12 articles

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob
Temperature is not a global knob. Allocate randomness per surface — routing, parsing, synthesis, generation, exploration — or pay for variance you do not need.
insideragents
Apr 2611 min
The Wiki Has a Second Tenant: Why Docs for AI Agents Are Different from Docs for Humans
Engineering orgs are quietly accumulating a second audience for their internal docs — every developer's AI assistant. The team that writes for only one of those readers is shipping the other broken.
ai-engineeringdocumentation
Apr 2610 min
Embedding Migrations Are the New Schema Migrations
Swapping an embedding model in production is not a batch job — it's a schema migration with semantic consequences. Why pointwise evals miss the regression, what dual-write windows and neighborhood-stability metrics actually buy you, and where the cost frame surprises teams.
insiderrag
Apr 2612 min
Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait
An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Audit, rotate, and de-monoculture your benchmark before green CI becomes a self-flattering lie.
evalsllmops
Apr 2611 min
The Eval Harness, Not the Prompt, Is Your Real Provider Lock-In
Prompt rewrites are the easy part of switching LLM providers. The eval harness is where the real lock-in lives — and the bill comes due the day you try to renegotiate.
llmevals
Apr 2610 min
The Eval-Rig Latency Lie: Why Your p95 Doubles in Production
Eval suites measure a quiet machine running serial calls against warm caches; production is a different system. Treat latency as a property of a deployment, not a model, or your p95 will lie.
llm-opsevaluation
Apr 2610 min
Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It
The 47-criterion rubric your engineers wrote to make the LLM-as-judge work has quietly become your product specification. Every weight, every score boundary, every missing criterion is a product decision the PM never made on the record.
insiderai-engineering
Apr 2611 min
The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse
An LLM eval suite is a simulator. Skip the recalibration cycle and you ship six green releases against a dataset that stopped resembling production around month three.
insiderllm
Apr 2611 min
When Your Evals Disagree: A Signal Hierarchy for the Week the Numbers Contradict Each Other
Four eval signals on a prompt change rarely agree. Without a written hierarchy for which signal wins under which conditions, every release week becomes a debate about whose number to trust.
evalsllm-evaluation
Apr 2612 min
Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model
Few-shot examples are tuned to a specific model. After a model upgrade, the demonstrations that lifted accuracy can quietly start dragging it down — here is the audit and provenance discipline that prevents rot.
insiderllm
Apr 2610 min
Found Capabilities: When Users Ship Features Your Team Never Roadmapped
Every sufficiently capable model exposes behaviors your team never roadmapped. Users find them, build workflows on top, and treat the next model upgrade as a regression. Here's the product discipline that turns found capabilities into decisions you actually own.
insiderai
Apr 2610 min
Generative UI as a Production Discipline: When the Model Renders the Screen
When the model emits component trees instead of text, design review, accessibility audits, and prompt-injection threat models all have to be rebuilt from scratch.
insidergenerative-ui
Apr 2612 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 47

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

The Wiki Has a Second Tenant: Why Docs for AI Agents Are Different from Docs for Humans

Embedding Migrations Are the New Schema Migrations

Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait

The Eval Harness, Not the Prompt, Is Your Real Provider Lock-In

The Eval-Rig Latency Lie: Why Your p95 Doubles in Production

Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

When Your Evals Disagree: A Signal Hierarchy for the Week the Numbers Contradict Each Other

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

Found Capabilities: When Users Ship Features Your Team Never Roadmapped

Generative UI as a Production Discipline: When the Model Renders the Screen

About Tian Pan