Skip to main content

861 posts tagged with "insider"

View all tags

The User Adaptation Trap: Why Rolling Back an AI Model Can Break Things Twice

· 9 min read
Tian Pan
Software Engineer

You shipped a model update. It looked fine in offline evals. Then, two weeks later, you notice your power users are writing longer, more qualified prompts — hedging in ways they never used to. Your support queue fills with vague complaints like "the AI feels off." You dig in and realize the update introduced a subtle behavior shift: the model has been over-confirming user ideas, validating bad plans, and softening its pushback. You decide to roll back.

Here is where it gets worse. When you roll back, a new wave of complaints arrives. Users say the model feels cold, terse, unhelpful — the opposite of what the original rollback complainers said. What happened? The users who interacted with the broken version long enough built new workflows around it. They learned to drive harder, push back more, frame questions more aggressively. The rollback removed the behavior they had adapted to, leaving them stranded.

This is the user adaptation trap. A subtly wrong behavior, left in production long enough, gets baked into user habits. Rolling it back doesn't restore the status quo — it creates a second disruption on top of the first.

Why Vision Models Ace Benchmarks but Fail on Your Enterprise PDFs

· 9 min read
Tian Pan
Software Engineer

A benchmark result of 97% accuracy on a document understanding dataset looks compelling until you run it against your company's actual invoice archive and realize it's quietly garbling 30% of the line items. The model doesn't throw an error. It doesn't return low confidence. It just produces output that looks plausible and is wrong.

This is the defining failure mode of production document AI: silent corruption. Unlike a crash or an exception, silent corruption propagates. The garbled table cell flows into the downstream aggregation, the aggregation feeds a report, the report drives a decision. By the time you notice, tracing the root cause is archaeology.

The gap between benchmark performance and production performance in document AI is real, persistent, and poorly understood by teams evaluating these models. Understanding why it exists — and how to defend against it — is the engineering problem this post addresses.

AI-Native API Design: Why REST Breaks When Your Backend Thinks Probabilistically

· 11 min read
Tian Pan
Software Engineer

Most backend engineers can recite the REST contract from memory: client sends a request, server processes it, server returns a status code and body. A 200 means success. A 4xx means the client did something wrong. A 5xx means the server broke. The response is deterministic, the timeout is predictable, and idempotency keys guarantee safe retries.

LLM backends violate every one of those assumptions. A 200 OK can mean your model hallucinated the entire response. A successful request can take twelve minutes instead of twelve milliseconds. Two identical requests with identical parameters will return different results. And if your server times out mid-inference, you have no idea whether the model finished or not.

Teams that bolt LLMs onto conventional REST APIs end up with a graveyard of hacks: timeouts that kill live agent tasks, clients that treat hallucinated 200s as success, retry logic that charges a user's credit card three times because idempotency keys weren't designed for probabilistic operations. This post walks through where the mismatch bites hardest and what the interface patterns that actually hold up in production look like.

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

· 12 min read
Tian Pan
Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

The AI Ops Dashboard Nobody Builds Until It's Too Late

· 11 min read
Tian Pan
Software Engineer

The most dangerous indicator on your AI system's health dashboard is a green status light next to a 99.9% uptime number. If your first signal of a failing model is a support ticket, you don't have observability — you have vibes.

Traditional APM tools were built for a world where failure is binary: the request succeeded or it didn't. For LLM-powered features, that model breaks down completely. A request can complete in 300ms, return HTTP 200, consume tokens, and produce an answer that is confidently wrong, unhelpful, or quietly degraded from what it produced six weeks ago. None of those failure states trigger your existing alerts.

Research consistently shows that latency and error rate together cover less than 20% of the failure space for LLM-powered features. The other 80% hides in five failure modes that most teams discover only after users have already noticed.

Chatbot, Copilot, or Agent: The Taxonomy That Changes Your Architecture

· 10 min read
Tian Pan
Software Engineer

The most expensive architectural mistake in AI engineering is not picking the wrong model. It's picking the wrong interaction paradigm. Teams that should be building an agent spend six months refining a chatbot, then wonder why users can't get anything done. Teams that should be building a copilot wire up full agentic autonomy and spend the next quarter firefighting unauthorized actions and runaway costs.

The taxonomy matters before you write a single line of code, because chatbots, copilots, and agents have fundamentally different trust models, context-window strategies, and error-recovery requirements. Getting this wrong doesn't just produce a worse product — it produces a product that cannot be fixed by tuning prompts or swapping models.

The Cold Start Problem in AI Personalization: Being Useful Before You Have Data

· 11 min read
Tian Pan
Software Engineer

Most personalization systems are built around a flywheel: users interact, you learn their preferences, you show better recommendations, they interact more. The flywheel spins faster as data accumulates. The problem is the flywheel needs velocity to generate lift — and a new user has none.

This is the cold start problem. And it's more dangerous than most teams recognize when they first ship personalization. A new user arrives with no history, no signal, and often a skeptical prior: "AI doesn't know me." You have roughly 5–15 minutes to prove otherwise before they form an opinion that determines whether they'll stay long enough to generate the data that would let you actually help them. Up to 75% of new users abandon products in the first week if that window goes badly.

The cold start problem isn't a data problem. It's an initialization problem. The engineering question is: what do you inject in place of history?

Why '92% Accurate' Is Almost Always a Lie

· 8 min read
Tian Pan
Software Engineer

You launch an AI feature. The model gets 92% accuracy on your holdout set. You present this to the VP of Product, the legal team, and the head of customer success. Everyone nods. The feature ships.

Three months later, a customer segment you didn't specifically test is experiencing a 40% error rate. Legal is asking questions. Customer success is fielding escalations. The VP of Product wants to know why no one flagged this.

The 92% figure was technically correct. It was also nearly useless as a decision-making input — because headline accuracy collapses exactly the information that matters most.

The Data Flywheel Is Not Free: Engineering Feedback Loops That Actually Improve Your AI Product

· 11 min read
Tian Pan
Software Engineer

There is a pattern that plays out in nearly every AI product team: the team ships an initial model, users start interacting with it, and someone adds a thumbs-up/thumbs-down widget at the bottom of responses. They call it their feedback loop. Three months later, the model has not improved. The team wonders why the flywheel isn't spinning.

The problem isn't execution. It's that explicit ratings are not a feedback loop — they're a survey. Less than 1% of production interactions yield explicit user feedback. The 99% who never clicked anything are sending you far richer signals; you're just not collecting them. Building a real feedback loop means instrumenting your system to capture behavioral traces, label them efficiently at scale, and route them back into training and evaluation in a way that compounds over time.

The Implicit Feedback Trap: Why Engagement Metrics Lie About AI Quality

· 8 min read
Tian Pan
Software Engineer

A Canadian airline's support chatbot invented a bereavement fare policy that didn't exist. The chatbot was confident, well-formatted, and polite. Passengers believed it. A court later held the airline liable for the fabricated policy. Meanwhile, the chatbot's satisfaction scores were probably fine.

This is the implicit feedback trap. The signals most teams use to measure AI quality — thumbs-up ratings, click-through rates, satisfaction scores — are not just noisy. They are systematically biased toward measuring the wrong thing. And optimizing for them makes your AI worse.

Knowledge Graph vs. Vector Store: Choosing Your Retrieval Primitive

· 9 min read
Tian Pan
Software Engineer

Most teams stumble into vector stores because they're easy to start with, then discover a category of queries that simply won't work no matter how well they tune chunk size or embedding model. That's not a tuning problem — it's an architectural mismatch. Vector similarity and graph traversal are fundamentally different retrieval mechanisms, and the gap matters more as your queries get harder.

This is not a "use both" post. There are real trade-offs, and getting the choice wrong costs months of engineering time. Here's what the decision actually looks like in practice.

The LLM Local Development Loop: Fast Iteration Without Burning Your API Budget

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications discover the same problem around week three: every time someone runs the test suite, it fires live API calls, costs real money, takes 30+ seconds, and returns different results on each run. The "just hit the API" approach that felt fine during the prototype phase becomes a serious tax on iteration speed — and a meaningful line item on the bill. One engineering team audited their monthly API spend and found $1,240 out of $2,847 (43%) was pure waste from development and test traffic hitting live endpoints unnecessarily.

The solution is not to stop testing. It is to build the right kind of development loop from the start — one where the fast path is cheap and deterministic, and the slow path (real API calls) is reserved for when it actually matters.