Blog

Page 148

12 articles

The Self-Modifying Agent Horizon: When Your AI Can Rewrite Its Own Code
Self-modifying AI agents — systems that rewrite their own source code to improve benchmark performance — have jumped from research curiosity to reproducible result. Here is what the benchmark numbers actually mean, the failure modes buried in the papers, and the governance infrastructure you need before deploying any of this in production.
ai-agentsself-improvement
Apr 910 min
Semantic Caching for LLMs: The Cost Tier Most Teams Skip
Semantic caching eliminates LLM calls for semantically equivalent queries — but real production hit rates range from 10% to 70%. Here's the math, threshold tradeoffs, invalidation pitfalls, and failure modes to evaluate before you build.
llmcaching
Apr 911 min
The Semantic Failure Mode: When Your AI Runs Perfectly and Does the Wrong Thing
Production AI systems can return valid, confident responses while completely missing user intent. A practical framework for detecting and closing the gap between task completion and task correctness using implicit behavioral signals, trajectory analysis, and intent-alignment scoring.
ai-reliabilityproduction-ai
Apr 99 min
The Stale World Model Problem in Long-Running Agents
Long-running AI agents silently accumulate stale assumptions about external state—files, APIs, databases—that diverge from reality mid-task. Here's how the failure compounds, why no framework solves it automatically, and five patterns to build in explicit freshness guarantees.
ai-agentsagent-reliability
Apr 910 min
The Streaming Infrastructure Behind Real-Time Agent UIs
Four ways agent streaming fails in production — and the server-side architecture decisions for SSE transport, backpressure, graceful cancellation, and browser-refresh reconnection that actually make real-time agent UIs reliable.
ai-engineeringstreaming
Apr 912 min
Structured Output Reliability in Production LLM Systems
Naive JSON prompting fails 15–20% of the time in production. Learn how constrained decoding, schema design patterns, and the validate-retry loop eliminate structured output failures before they propagate through your pipeline.
llmreliability
Apr 910 min
The Sycophancy Tax: How Agreeable LLMs Silently Break Production AI Systems
LLM sycophancy is present in 58% of production deployments and evades standard evals — the flip test, pressure testing, and architectural patterns that catch it before it undermines your system's integrity.
insiderproduction-ai
Apr 99 min
Text-to-SQL in Production: Why Correct SQL Is the Easy Part
LLMs score 86% on SQL benchmarks and 10% on your actual warehouse. The queries that fail don't error—they return wrong data. A taxonomy of silent failure modes and the layered architecture that catches them.
text-to-sqlllm
Apr 910 min
The Three Attack Surfaces in Multi-Agent Communication
82% of frontier LLMs comply with malicious commands from peer agents even when refusing them from users. Here are the three distinct attack surfaces — prompt injection, agent spoofing, and memory poisoning — and the protocol-level defenses each requires.
insidersecurity
Apr 910 min
Why Your Thumbs-Down Data Is Lying to You: Selection Bias in Production AI Feedback Loops
Only 1–3% of users click rating buttons — and they are systematically different from everyone else. How selection bias distorts RLHF training data, amplifies preference collapse, and hides 80% of your quality problems, plus the five implicit behavioral signals that capture ground truth from every user.
feedback-loopsrlhf
Apr 99 min
The Token Economics of Chain-of-Thought: When Thinking Out Loud Costs More Than It's Worth
Chain-of-thought prompting inflates token costs 2–5x and adds seconds of latency — yet delivers no measurable accuracy gain for most production tasks. A decision framework for when CoT pays off, when it hurts, and cheaper alternatives like Chain-of-Draft that close the gap at a fraction of the cost.
chain-of-thoughtllm-cost-optimization
Apr 98 min
The Tool Result Validation Gap: Why AI Agents Blindly Trust Every API Response
AI agent frameworks validate tool inputs but never check what comes back. Here are the three categories of tool result failures and a tiered validation architecture that catches schema violations, stale data, and semantically wrong results before they reach users.
ai-agenttool-use
Apr 910 min

About Tian Pan

I'm Tian Pan, an engineer-founder focused on agentic engineering — building autonomous AI systems and scaling engineering teams. I write practical guides on system design, technical leadership, and shipping with AI agents. Previously an early engineer at Uber, Brex, and IoTeX.

Page 148

The Self-Modifying Agent Horizon: When Your AI Can Rewrite Its Own Code

Semantic Caching for LLMs: The Cost Tier Most Teams Skip

The Semantic Failure Mode: When Your AI Runs Perfectly and Does the Wrong Thing

The Stale World Model Problem in Long-Running Agents

The Streaming Infrastructure Behind Real-Time Agent UIs

Structured Output Reliability in Production LLM Systems

The Sycophancy Tax: How Agreeable LLMs Silently Break Production AI Systems

Text-to-SQL in Production: Why Correct SQL Is the Easy Part

The Three Attack Surfaces in Multi-Agent Communication

Why Your Thumbs-Down Data Is Lying to You: Selection Bias in Production AI Feedback Loops

The Token Economics of Chain-of-Thought: When Thinking Out Loud Costs More Than It's Worth

The Tool Result Validation Gap: Why AI Agents Blindly Trust Every API Response

About Tian Pan