Skip to main content

578 posts tagged with "insider"

View all tags

Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap

· 11 min read
Tian Pan
Software Engineer

Most distributed tracing setups work fine until you add agents. The moment your system has Agent A spawning Agent B across a microservice boundary—Agent B calling a tool server, that tool server fetching from a vector database—the coherent end-to-end view shatters into disconnected fragments. Your tracing backend shows individual operations, but you've lost the causal chain that tells you why something happened, which user request triggered it, and where in the pipeline 800 milliseconds went.

This isn't a monitoring configuration problem. It's a context propagation architecture problem, and it has a specific technical shape that most teams discover the hard way.

Embedding Drift: The Silent Degradation Killing Your Long-Lived RAG System

· 10 min read
Tian Pan
Software Engineer

Your RAG system is running fine. Latency is normal. Error rate is zero. But a user asking about "California employment law" keeps getting results about real estate — and your logs show nothing wrong.

This is embedding drift in action: the retrieval failure mode that doesn't throw exceptions, doesn't spike error rates, and doesn't show up in standard observability dashboards. It happens when your vector store accumulates embeddings produced under different conditions — different model versions, different chunking rules, different preprocessing pipelines — and the vectors start pointing in incompatible directions. The system keeps serving requests, but the semantic coordinates are no longer aligned, and retrieval quality erodes quietly over weeks or months.

The EU AI Act Features That Silently Trigger High-Risk Compliance — and What You Must Ship Before August 2026

· 9 min read
Tian Pan
Software Engineer

An appliedAI study of 106 enterprise AI systems found that 40% had unclear risk classifications. That number is not a reflection of regulatory complexity — it is a reflection of how many engineering teams shipped AI features without asking whether the feature changes their compliance tier. The EU AI Act has a hard enforcement date of August 2, 2026 for high-risk systems. At that point, being in the 40% is not a management problem. It is an architecture problem you will be fixing at four times the original cost, under deadline pressure, with regulators watching.

This article is not a legal overview. It is an engineering read on the specific product decisions that silently trigger high-risk classification, the concrete deliverables those classifications require, and why the retrofit path is so much more expensive than the build-it-in path.

Eval Set Decay: Why Your Benchmark Becomes Misleading Six Months After You Build It

· 10 min read
Tian Pan
Software Engineer

You spend three weeks curating a high-quality eval set. You write test cases that cover the edge cases your product manager worries about, sample real queries from beta users, and get a clean accuracy number that the team aligns on. Six months later, that number is still in the weekly dashboard. You just shipped a model update that looked great on evals. Users are filing tickets.

The problem isn't that the model regressed. The problem is that your eval set stopped representing reality months ago—and nobody noticed.

This failure mode has a name: eval set decay. It happens to almost every production AI team, and it's almost never caught until the damage is visible in user behavior.

Foundation Model Vendor Strategy: What Enterprise SLAs Actually Guarantee

· 12 min read
Tian Pan
Software Engineer

Enterprise teams pick LLM vendors based on benchmarks and demos. Then they hit production and discover what the SLA actually says — which is usually much less than they assumed. The 99.9% uptime guarantee you negotiated doesn't cover latency. The data processing agreement your legal team signed doesn't prohibit training on your inputs unless you explicitly added that clause. And the vendor concentration risk that nobody quantified becomes painfully obvious when your core product is down for four hours because a telemetry deployment cascaded through a Kubernetes control plane.

This is not a procurement problem. It's an engineering problem that procurement can't solve alone. The people who build AI systems need to understand what these contracts actually say — and what they don't.

The Evaluation Paradox: How Goodhart's Law Breaks AI Benchmarks

· 10 min read
Tian Pan
Software Engineer

In late 2024, OpenAI's o3 system scored 75.7% on the ARC-AGI benchmark — a test specifically designed to resist optimization. The AI research community celebrated. Then practitioners looked closer: o3 had been trained on 75% of the benchmark's public training set, and the highest-compute configuration used 172 times more resources than the baseline. It wasn't a capability breakthrough dressed up as a score. It was a score dressed up as a capability breakthrough.

This is the evaluation paradox. The moment a benchmark becomes the thing teams optimize for, it stops measuring what it was designed to measure. Goodhart's Law — "when a measure becomes a target, it ceases to be a good measure" — was articulated in 1970s economic policy, but it describes AI benchmarking with eerie precision.

Hallucination Is Not a Root Cause: A Debugging Methodology for AI in Production

· 10 min read
Tian Pan
Software Engineer

When a lawyer cited non-existent court cases in a federal filing, the incident was widely reported as "ChatGPT hallucinated." When a consulting firm's government report contained phantom footnotes, the postmortem read "AI fabricated citations." When a healthcare transcription tool inserted violent language into medical notes, the explanation was simply "the model hallucinated." In each case, an expensive failure got a three-word root cause that made remediation impossible.

"The model hallucinated" is the AI equivalent of writing "unknown error" in a stack trace. It describes what happened without telling you why it happened or how to fix it. Every hallucination has a diagnosable cause — usually one of four categories — and each category demands a different engineering response. Teams that understand this distinction ship AI systems that degrade gracefully. Teams that don't keep playing whack-a-mole with prompts.

The Idempotency Problem in Agentic Tool Calling

· 11 min read
Tian Pan
Software Engineer

The scenario plays out the same way every time. Your agent is booking a hotel room, and a network timeout occurs right after the payment API call returns 200 but before the confirmation is stored. The agent framework retries. The payment runs again. The customer is charged twice, support escalates, and someone senior says the AI "hallucinated a double charge" — which is wrong but feels right because nobody wants to say their retry logic was broken from the start.

This isn't an AI problem. It's a distributed systems problem that the AI layer imported wholesale, without the decades of hard-won patterns that distributed systems engineers developed to handle it. Standard agent retry logic assumes operations are idempotent. Most tool calls are not.

The Inference Optimization Trap: Why Making One Model Faster Can Slow Down Your System

· 9 min read
Tian Pan
Software Engineer

You swap your expensive LLM for a faster, cheaper distilled model. Latency goes up. Costs increase. Quality degrades. You roll back, confused, having just spent three weeks on optimization work that made everything worse.

This isn't a hypothetical. It's one of the most common failure modes in production AI systems, and it stems from a seductive but wrong mental model: that optimizing a component optimizes the system.

Invisible Model Drift: How Silent Provider Updates Break Production AI

· 10 min read
Tian Pan
Software Engineer

Your prompts worked on Monday. On Wednesday, users start complaining that responses feel off — answers are shorter, the JSON parsing downstream is breaking intermittently, the classifier that had been 94% accurate is now hovering around 79%. You haven't deployed anything. The model you're calling still has the same name in your config. But something changed.

This is invisible model drift: the silent, undocumented behavior changes that LLM providers push without announcement. It is one of the least-discussed operational hazards in AI engineering, and it hits teams that have done everything "right" — with evals, with monitoring, with stable prompt engineering. The model just changed underneath them.

The Idempotency Crisis: LLM Agents as Event Stream Consumers

· 11 min read
Tian Pan
Software Engineer

Every event streaming system eventually delivers the same message twice. Network hiccups, broker restarts, offset commit failures — at-least-once delivery is not a bug; it's the contract. Traditional consumers handle this gracefully because they're deterministic: process the same event twice, get the same result, write the same record. The second write is a no-op.

LLMs are not deterministic processors. The same prompt with the same input produces different outputs on each run. Even with temperature=0, floating-point arithmetic, batch composition effects, and hardware scheduling variations introduce variance. Research measuring "deterministic" LLM settings found accuracy differences up to 15% across naturally occurring runs, with best-to-worst performance gaps reaching 70%. At-least-once delivery plus a non-deterministic processor does not give you at-most-once behavior. It gives you unpredictable behavior — and that's a crisis waiting to happen in production.

LLM-Powered Data Pipelines: The ETL Tier Nobody Benchmarks

· 10 min read
Tian Pan
Software Engineer

Most conversations about LLMs in production orbit around chat interfaces, copilots, and autonomous agents. But if you audit where enterprise LLM tokens are actually being consumed, a different picture emerges: a quiet majority of usage is happening inside batch data pipelines — extracting fields from documents, classifying support tickets, normalizing messy vendor records, enriching raw events with semantic labels. Nobody is writing conference talks about this tier. Nobody is benchmarking it seriously either. And that silence is costing teams real money and real accuracy.

This is the ETL tier that practitioners build first, justify last, and monitor least. It is also, for most organizations, the layer where LLM spend has the highest leverage — and the highest potential for invisible failure.