780 posts tagged with "ai-engineering"

Amortizing Context: Persistent Agent Memory vs. Long-Context Windows

April 20, 2026 · 9 min read

Software Engineer

When 1 million-token context windows became commercially available, a lot of teams quietly decided they'd solved agent memory. Why build a retrieval system, manage a vector database, or design an eviction policy when you can just dump everything in and let the model sort it out? The answer comes back in your infrastructure bill. At 10,000 daily interactions with a 100k-token knowledge base, the brute-force in-context approach costs roughly $5,000/day. A retrieval-augmented memory system handling the same load costs around $333/day — a 15x gap that compounds as your user base grows.

The real problem isn't just cost. It's that longer contexts produce measurably worse answers. Research consistently shows that models lose track of information positioned in the middle of very long inputs, accuracy drops predictably when relevant evidence is buried among irrelevant chunks, and latency climbs in ways that make interactive agents feel broken. The "stuff everything in" approach doesn't just waste money — it trades accuracy for the illusion of simplicity.

Behavioral Signals That Actually Measure User Satisfaction in AI Products

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Most AI product teams ship a thumbs-up/thumbs-down widget and call it a satisfaction measurement system. They are measuring something — just not satisfaction.

A developer who presses thumbs-down on a Copilot suggestion because the function signature is wrong, and a developer who presses thumbs-down because the suggestion was excellent but not what they needed right now, are generating the same signal. Meanwhile, the developer who quietly regenerated the response four times before giving up generates no explicit signal at all. That absent signal is a better predictor of churn than anything the rating widget captures.

The implicit behavioral record your users leave while using your AI product is richer, more honest, and more actionable than anything they'll type or tap voluntarily. This post covers which signals to collect, why they outperform explicit feedback, and the instrumentation schema that keeps AI-specific telemetry from poisoning your general product analytics.

Cache Invalidation for AI: Why Every Cache Layer Gets Harder When the Answer Can Change

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — was coined before language models entered production. Add AI to the stack and cache invalidation doesn't just get harder; it gets harder at every layer simultaneously, for fundamentally different reasons at each one.

Traditional caches store deterministic outputs: the database row, the rendered HTML, the computed price. When the source changes, you invalidate the key, and the next request fetches fresh data. The contract is simple because the answer is a fact.

AI caches store something different: responses to queries where the "correct" answer depends on context, recency, model behavior, and the source documents the model was given. Stale here doesn't mean outdated — it means semantically wrong in ways your monitoring won't catch until a user notices.

The CAP Theorem for AI Agents: Choosing Consistency or Availability When Your LLM Is the Bottleneck

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Every engineer who has shipped a distributed system has stared at the CAP theorem and made a choice: when the network partitions, do you keep serving stale data (availability) or do you refuse to serve until you have a consistent answer (consistency)? The theorem tells you that you cannot have both.

AI agents face an identical tradeoff, and almost nobody is making it explicitly. When your LLM call times out, when a tool returns garbage, when a downstream API is unavailable — what does your agent do? In most production systems, the answer is: it guesses. Quietly. Confidently. And often wrong.

The failure mode isn't dramatic. There's no exception in the logs. The agent "answered" the user. You only find out two weeks later when someone asks why the system booked the wrong flight, extracted the wrong entity, or confidently told a customer a price that no longer exists.

Chunking Strategy Is the Hidden Load-Bearing Decision in Your RAG Pipeline

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Most RAG quality conversations focus on the wrong things. Teams debate embedding model selection, tweak retrieval top-K, and experiment with prompt templates — while a single architectural decision made during ingestion quietly caps how good the system can ever be. That decision is chunking strategy: how you cut documents into pieces before indexing them.

A 2025 benchmark study found that chunking configuration has as much or more influence on retrieval quality as embedding model choice. And yet teams routinely pick a default — 512 tokens with RecursiveCharacterTextSplitter, usually — and then spend months wondering why their retrieval precision keeps disappointing them. The problem was baked in at index time. Swapping models cannot fix it.

Communicating AI Limitations Across the Organization: A Framework for Engineering Leaders

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

The demo worked perfectly. Legal had signed off. Sales was already promising customers the feature would ship next quarter. Then the first production failure happened — the model confidently drafted a clause that cited a contract term that didn't exist, sales forwarded it to a customer, and legal spent three weeks in damage control.

This is not a story about a bad model. It's a story about miscommunication. The engineering team knew the model could hallucinate. Legal assumed it wouldn't. Sales assumed any failure would be caught before reaching customers. Ops assumed someone else was monitoring for exactly this. Nobody was lying. Everyone was working from a different mental model of the same system.

The root cause of most AI project failures isn't the AI. According to RAND Corporation's analysis of failed AI initiatives, "misunderstood problem definition" — which includes miscommunication about capability limits — is the single most common cause. Between 70 and 95% of enterprise AI initiatives fail to deliver their intended outcomes, and the technology is rarely the limiting factor. The limiting factor is that every team in your organization is quietly building a different theory of what your AI system does, and nobody has explicitly corrected any of them.

The Compound Accuracy Problem: Why Your 95% Accurate Agent Fails 40% of the Time

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your agent performs beautifully in isolation. You've benchmarked each step. You've measured per-step accuracy at 95%. You demo the system to stakeholders and it looks great. Then you ship it, and users report that it fails almost half the time.

The failure isn't a bug in any individual component. It's the math.

Contract Testing for AI Pipelines: Schema-Validated Handoffs Between AI Components

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Most AI pipeline failures aren't model failures. The model fires fine. The output looks like JSON. The downstream stage breaks silently because a field was renamed, a type changed, or a nested object gained a new required property that the next stage doesn't know how to handle. The pipeline runs to completion and reports success. Somewhere in the data warehouse, numbers are wrong.

This is the contract testing problem for AI pipelines, and it's one of the most underaddressed reliability risks in production AI systems. According to recent infrastructure benchmarks, the average enterprise AI system experiences nearly five pipeline failures per month—each taking over twelve hours to resolve. The dominant cause isn't poor model quality. It's data quality and schema contract violations: 64% of AI risk lives at the schema layer.

Conversation State Is Not a Chat Array: Multi-Turn Session Design for Production

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Most multi-turn LLM applications store conversation history as an array of messages. It works fine in demos. It breaks in production in ways that take days to diagnose because the failures look like model problems, not infrastructure problems.

A user disconnects mid-conversation and reconnects to a different server instance—session gone. An agent reaches turn 47 in a complex task and the payload quietly exceeds the context window—no error, just wrong answers. A product manager asks "can we let users try a different approach from step 3?"—and the engineering answer is "no, not with how we built this." These are not edge cases. They are the predictable consequences of treating conversation state as a transient array rather than a first-class resource.

The Data Quality Ceiling That Prompt Engineering Can't Break Through

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

A telecommunications company spent months tuning prompts on their customer service chatbot. They iterated on system instructions, few-shot examples, chain-of-thought formatting. The hallucination rate stayed stubbornly above 50%. Then they audited their knowledge base and found it was filled with retired service plans, outdated billing information, and duplicate policy documents that contradicted each other. After fixing the data — not the prompts — hallucinations dropped to near zero. The fix that prompt engineering couldn't deliver took three weeks of data cleanup.

This is the data quality ceiling: a hard performance wall that blocks every LLM system fed on noisy, stale, or inconsistent data, and that no amount of prompt iteration can breach. It's one of the most common failure modes in production AI, and one of the most systematically underdiagnosed. Teams that hit this wall keep turning the prompt knobs when the problem is upstream.

EU AI Act Compliance Is an Engineering Problem: The Audit Trail You Have to Ship

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Most engineering teams building AI systems in 2026 understand that the EU AI Act exists. Very few understand what it actually requires them to build. The regulation's core obligations for high-risk AI systems — automatic event logging, human oversight mechanisms, risk management systems, technical documentation — are not policy artifacts that a legal team can produce on a deadline. They are engineering deliverables that require architectural decisions made at the start of a project, not in the final sprint before a compliance audit.

The hard deadline is August 2, 2026. High-risk AI systems deployed in the EU must be in full compliance with Articles 9 through 15. Organizations deploying AI in employment screening, credit scoring, benefits allocation, healthcare prioritization, biometric identification, or critical infrastructure management are in scope. If your system makes decisions that materially affect people in those domains and touches EU residents, it is almost certainly high-risk. And realistic compliance implementation timelines run 8 to 14 months — which means if you haven't started, you're already late.

The Golden Dataset Decay Problem: When Your Eval Set Becomes a Liability

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams treat their golden eval set like a constitution — permanent, authoritative, and expensive to touch. They spend weeks curating examples, getting domain experts to label them, and wiring them into CI. Then they move on.

Six months later, the eval suite reports 87% pass rate while users are complaining about broken outputs. The evals haven't regressed — they've decayed. The dataset still measures what mattered in October. It just no longer measures what matters now.

This is the golden dataset decay problem, and it's more common than most teams admit.

About Tian Pan