Skip to main content

861 posts tagged with "insider"

View all tags

Silent Async Agent Failures: Why Your AI Jobs Die Without Anyone Noticing

· 9 min read
Tian Pan
Software Engineer

Async AI jobs have a problem that traditional background workers don't: they fail silently and confidently. A document processing agent returns HTTP 200, logs a well-formatted result, and moves on — while the actual output is subtly wrong, partially complete, or based on a hallucinated fact three steps back. Your dashboards stay green. Your on-call engineer sleeps through it. Your customers eventually notice.

This is not an edge case. It's the default behavior of async AI systems that haven't been deliberately designed for observability. The tools that keep background job queues reliable in conventional distributed systems — dead letter queues, idempotency keys, saga logs — also work for AI agents. But the failure modes are different enough that they require some translation.

Your LLM Eval Is Lying to You: The Statistical Power Problem

· 9 min read
Tian Pan
Software Engineer

Your team spent three days iterating on a system prompt. The eval score went from 82% to 85%. You ship it. Three weeks later, production metrics are flat. What happened?

The short answer: your eval lied to you. Not through malice, but through insufficient sample size and ignored variance. A 3-point accuracy lift on a 100-example test set is well within the noise floor of most LLM systems. You cannot tell signal from randomness at that scale — but almost no one does the math to verify this before acting on results.

This is the statistical power problem in LLM evaluation, and it is quietly corrupting the iteration loops of most teams building AI products.

The AI Adoption Paradox: Why the Highest-Value Domains Get AI Last

· 8 min read
Tian Pan
Software Engineer

The teams that stand to gain the most from AI are often the last ones deploying it. A healthcare organization that could use AI to catch medication errors in real time sits at 39% AI adoption, while a software company running AI-powered code review ships at 92%. The ROI differential is not even close — yet the adoption rates are inverted. This is the AI adoption paradox, and it's not an accident.

The instinct is to explain this gap as risk aversion, regulatory fear, or bureaucratic inertia. Those factors exist. But the deeper cause is structural: the accuracy threshold required to unlock value in high-stakes domains is fundamentally higher than what justifies autonomous deployment, and most teams haven't built the architecture to bridge that gap.

The Curriculum Trap: Why Fine-Tuning on Your Best Examples Produces Mediocre Models

· 10 min read
Tian Pan
Software Engineer

Every fine-tuning effort eventually hits the same intuition: better data means better models, and better data means higher-quality examples. So teams build elaborate annotation pipelines to filter out the mediocre outputs, keep only the gold-standard responses, and train on a dataset they're proud of. The resulting model then underperforms on the exact use cases that motivated the project. This failure is so common it deserves a name: the curriculum trap.

The trap is this — curating only your best, most confident, most authoritative outputs doesn't teach the model to be better. It teaches the model to perform confidence regardless of whether confidence is warranted. You produce something that looks impressive in demos and falls apart in production, because production is full of the messy edge cases your curation process systematically excluded.

The Overclaiming Trap: When Being Right for the Wrong Reasons Destroys AI Product Trust

· 10 min read
Tian Pan
Software Engineer

Most AI product post-mortems focus on the same story: the model was wrong, users noticed, trust eroded. The fix is obvious — improve accuracy. But there is a more insidious failure mode that post-mortems rarely capture because standard accuracy metrics don't surface it: the model was right, but for the wrong reasons, and the power users who checked the reasoning never came back.

Call it the overclaiming trap. It is the failure mode where correct final answers are backed by fabricated, retrofitted, or structurally unsound reasoning chains. It is more dangerous than ordinary wrongness because it looks like success until your most sophisticated users start quietly leaving.

Tokenizer Arithmetic: The Hidden Layer That Bites You in Production

· 10 min read
Tian Pan
Software Engineer

A team ships a JSON extraction pipeline. It works perfectly in development: 98% accuracy, clean structured output, predictable token counts. They push to production. The model starts hallucinating extra whitespace, the JSON parser chokes on malformed keys, and the API bill is 2.3x what the prototype suggested. The model hasn't changed. The prompts haven't changed.

The tokenizer changed — or more precisely, their assumptions about it were wrong from the start.

Tokenization is the first transformation your input undergoes and the last one engineers think about when debugging. Most teams treat it as a solved problem: text goes in, tokens come out, the model does its thing. But Byte Pair Encoding (BPE), the tokenization algorithm behind most production LLMs, makes decisions that cascade through structured output generation, prefix caching, cost estimation, and multilingual deployment in ways that are entirely predictable once you know to look.

The Trust Calibration Gap: Why AI Features Get Ignored or Blindly Followed

· 9 min read
Tian Pan
Software Engineer

You shipped an AI feature. The model is good — you measured it. Precision is 91%, recall is solid, the P99 latency is under 400ms. Three months later, product analytics tell a grim story: power users have turned it off entirely, while a different cohort is accepting every suggestion without changing a word, including the ones that are clearly wrong.

This is the trust calibration gap. It's not a model problem. It's a design problem — and it's more common than most AI product teams admit.

When the Prompt Engineer Leaves: The AI Knowledge Transfer Problem

· 9 min read
Tian Pan
Software Engineer

Six months after your best prompt engineer rotates off to a new project, a customer-facing AI feature starts misbehaving. Response quality has degraded, the output format occasionally breaks, and there's a subtle but persistent tone problem you can't quite name. You open the prompt file. It's 800 words of natural language. There's no changelog, no comments, no test cases. The person who wrote it knew exactly why every phrase was there. That knowledge is gone.

This is the prompt archaeology problem, and it's already costing teams real money. A national mortgage lender recently traced an 18% accuracy drop in document classification to a single sentence added to a prompt three weeks earlier during what someone labeled "routine workflow optimization." Two weeks of investigation, approximately $340,000 in operational losses. The author of that change had already moved on.

Corpus Curation at Scale: Why Your RAG Quality Ceiling Is Your Document Quality Floor

· 10 min read
Tian Pan
Software Engineer

There's a belief embedded in most RAG architectures that goes something like this: if retrieval returns the right chunks, the LLM will produce correct answers. Teams invest heavily in embedding model selection, hybrid retrieval strategies, and reranking pipelines. Then, three months after deploying to production, answer quality quietly degrades — not because the model changed, not because query patterns shifted dramatically, but because the underlying corpus rotted.

Enterprise RAG implementations fail at a roughly 40% rate, and the failure mode that practitioners underestimate most isn't hallucination or poor retrieval recall. It's document quality. One analysis found that a single implementation improved search accuracy from 62% to 89% by introducing document quality scoring — with no changes to the embedding model or retrieval algorithm. The corpus was the variable. The corpus was always the variable.

Data Provenance for AI Systems: Why Tracking Answer Origins Is Now an Engineering Requirement

· 10 min read
Tian Pan
Software Engineer

A production LLM answers a user's question incorrectly. A support ticket arrives. You pull the logs. They show the prompt, the completion, and the latency — but nothing about which documents the retrieval system surfaced, which chunks landed in the context window, or which passage the model leaned on most heavily when it synthesized the answer. You're left doing archaeology: re-running the query against a corpus that has since been updated, hoping the same results come back, wondering if the bug is in retrieval, in chunking, in the document itself, or in the model's reasoning.

This is the data provenance gap, and most AI teams don't notice it until they're already in it.

GPU Scheduling for Mixed LLM Workloads: The Bin-Packing Problem Nobody Solves Well

· 10 min read
Tian Pan
Software Engineer

Most GPU clusters running LLM inference are wasting between 30% and 50% of their available compute. Not because engineers are careless, but because the scheduling problem is genuinely hard—and the tools most teams reach for first were never designed for it.

The standard approach is to stand up Kubernetes, request whole GPUs per pod, and let the scheduler figure it out. This works fine for training jobs. For inference across a heterogeneous set of models, it quietly destroys utilization. A cluster running three different 7B models with sporadic traffic will find each GPU busy less than 15% of the time, while remaining fully "allocated" and refusing to schedule new work.

The root cause is a mismatch between how Kubernetes thinks about GPUs and what LLM inference actually requires.

The Institutional Knowledge Drain: How AI Agents Absorb Decisions Without Transferring Understanding

· 10 min read
Tian Pan
Software Engineer

Three months after a fintech team rolled out an AI coding agent to handle their routine backend tasks, a senior engineer left for another company. When the team tried to reconstruct why certain authentication decisions had been made six weeks earlier, nobody could. The PR descriptions said "implemented as discussed." The commit messages said "per requirements." The AI agent had made the choices, the code worked, and the reasoning had evaporated.

This is not a documentation failure. It is what happens when the channel through which understanding normally flows — the back-and-forth between engineers, the friction of explanation, the pressure of justifying a decision to another human — is replaced by a system that optimizes for output rather than comprehension.