Blog

Page 51

12 articles

Shadow MCP: The Tool Servers Your Security Team Has Never Heard Of Are Already Running on Your Engineers' Laptops
MCP made it trivially cheap to wire a developer laptop into prod-adjacent systems. The artifact is a loopback socket using credentials the engineer already has — invisible to procurement, CASB, and SSO logs. The discovery and governance discipline that has to land before the first breach disclosure.
insiderai-agents
Apr 2613 min
The Shared-Prompt Flag Day: When One Edit Becomes Thirty Teams' Regression
Centralizing a safety preamble looks like a clean DRY win until the first edit ships and thirty consumer teams' evals tank. Here's why shared prompts behave like distributed systems, and the governance scaffolding that survives the first flag day.
insiderllmops
Apr 2610 min
Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization
Speculative decoding promises identical model output at 3-6x speedup, but that guarantee binds tokens leaving the inference engine — not bytes already shown to the user. When you stream draft tokens before verification, rejected suffixes have to be retracted, and which surfaces tolerate retraction is a product decision the inference team rarely scopes.
speculative-decodingllm-inference
Apr 2612 min
The AI Feature Metric Trap: Why DAU and Retention Lie About Stochastic Surfaces
DAU, conversion, and retention were built for click streams. AI features emit task arcs — request, response, follow-up, resolution — and the dashboard you imported from the deterministic playbook will tell you the feature is winning while users route around it.
ai-engineeringproduct-metrics
Apr 2611 min
Your stop_reason Is Lying: Building the Real Stop Taxonomy Production Triage Needs
Vendor stop_reason values give you four buckets when production triage needs eight. Here is how to build the parallel stop-taxonomy that turns a black-box termination into a debuggable signal.
llmobservability
Apr 2612 min
Streaming JSON Parsers: The Gap Between Tokens and Typed Objects
JSON.parse is all-or-nothing, but LLM token streams are not. Why streaming structured output is one design problem the API and the SDK have to solve together — and what a real partial parser must do.
insiderllm
Apr 2612 min
Structured Concurrency for Parallel Tool Fanout: Who Owns Partial Failure?
Most agent frameworks run parallel tool calls as detached goroutines, then rediscover the failure modes structured concurrency solved two decades ago — partial failure, honored cancellation, runaway cost.
insideragents
Apr 2611 min
Synthetic Users for Multi-Turn Agent Eval: When Your Test Fixture Has To Push Back
Single-turn evals miss the multi-turn failure modes that matter. LLM-driven user simulators with personas, patience budgets, and abandonment thresholds run thousands of conversations a night — but only when the simulator-vs-production gap is calibrated, not assumed.
llm-agentsevaluation
Apr 269 min
System Prompts as Code, Config, or Data: The Architecture Decision That Cascades Into Everything
Most teams pick where their system prompt lives by accident, then fight the consequences for years. The choice between code, config, and data storage cascades into deploy cadence, eval scope, and tenant flexibility — here is the framework to apply before MVP.
llmai-engineering
Apr 2612 min
The Three Tastes of an AI Engineer: Why Prompts, Evals, and Guardrails Don't Live in the Same Head
Prompt taste, eval taste, and guardrail taste are three separate intuitions that the AI engineer job title hides. Hire and promote as if they were one skill and you ship lopsided systems where every artifact is green and the user is leaving.
hiringai-engineering
Apr 2611 min
The Tip Jar Problem: When 5% of Your Users Burn 80% of Your Inference Budget
Flat-rate pricing for token-billed AI products produces a power-law usage distribution where a tiny minority of whales destroys margins. The standard fixes — caps, throttles, fair-use clauses — alienate the engaged users who would pay more if you let them. Here is the tier architecture, metering pre-work, and unit-economics discipline that actually fits how token costs behave.
ai-pricingunit-economics
Apr 2611 min
Token Amplification: The Prompt-Injection Attack That Burns Your Bill
Most prompt-injection threat models focus on data exfiltration. The quieter attack class is bill amplification — a $0.01 request becomes a $40 inference invoice. Here is the defense discipline that stops it.
insiderai-engineering
Apr 2610 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 51

Shadow MCP: The Tool Servers Your Security Team Has Never Heard Of Are Already Running on Your Engineers' Laptops

The Shared-Prompt Flag Day: When One Edit Becomes Thirty Teams' Regression

Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization

The AI Feature Metric Trap: Why DAU and Retention Lie About Stochastic Surfaces

Your stop_reason Is Lying: Building the Real Stop Taxonomy Production Triage Needs

Streaming JSON Parsers: The Gap Between Tokens and Typed Objects

Structured Concurrency for Parallel Tool Fanout: Who Owns Partial Failure?

Synthetic Users for Multi-Turn Agent Eval: When Your Test Fixture Has To Push Back

System Prompts as Code, Config, or Data: The Architecture Decision That Cascades Into Everything

The Three Tastes of an AI Engineer: Why Prompts, Evals, and Guardrails Don't Live in the Same Head

The Tip Jar Problem: When 5% of Your Users Burn 80% of Your Inference Budget

Token Amplification: The Prompt-Injection Attack That Burns Your Bill

About Tian Pan