Skip to main content

369 posts tagged with "ai-engineering"

View all tags

The Synthetic Preference Trap: How AI-Ranked RLHF Quietly Drifts Your Model Into the Teacher's Voice

· 12 min read
Tian Pan
Software Engineer

The first sign is almost always the same: your internal eval dashboard is green, reward-model scores are climbing, DPO loss is trending right — and a customer on a Zoom call shrugs and says "it sounds like ChatGPT now." No one on the training team wants to hear that. The evals say the model is better. The annotators who shipped the last batch of preferences say the model is better. But the user is telling you the truth, and the dashboard is lying. What broke is not any single label. What broke is that your preference data is no longer yours.

This is the synthetic preference trap. Label budgets get squeezed, someone proposes using a stronger model to rank a second model's completions, the experiment ships, and for a while it looks like a free lunch. The student model learns to sound more like the teacher on every turn, and because your reward model was trained on data the teacher also influenced, your reward model cheerfully agrees. The user sees a product that reads exactly like every other product built on top of the same frontier API. The differentiation you thought you were buying with fine-tuning has been quietly distilled away.

Your Tool Descriptions Are Prompts, Not API Docs

· 10 min read
Tian Pan
Software Engineer

The tool description is not documentation. It is the prompt the model reads, every single turn, to decide whether this tool fires and how. You are not writing for the developer integrating against the tool — the developer already has the schema, the types, the examples in the PR. You are writing for a stochastic reader that has never seen this codebase, is holding twenty other tool descriptions in the same context window, and has to pick one in the next forward pass.

Most teams don't. They paste the OpenAPI summary into the description field, stick the JSON Schema under it, and ship. Then the agent undercalls the tool, confidently calls the wrong adjacent tool, or fires the right tool with parameters that were "obviously" wrong to any human reading the schema. The team blames the model. The model was reading exactly what you wrote.

The Validator Trap: How Post-Hoc Guards Rot Your Prompt From the Inside

· 9 min read
Tian Pan
Software Engineer

The first time a validator catches a bad LLM output, it feels like a win. The second time, you tweak the prompt to make the failure less likely. By the twentieth time, nobody on the team can explain why three paragraphs of the prompt exist — they are scar tissue from incidents long forgotten, and the model is spending more tokens reading warnings than reasoning about the actual task.

This is the validator trap. Every post-hoc guard you add — a JSON schema check, a regex, a content classifier, a second LLM-as-judge — exerts feedback pressure on the upstream prompt. The prompt grows defensive instructions to appease the guard, the guard in turn catches a new class of failure, and you add more instructions. Each iteration looks local and sensible. In aggregate, the system gets slower, more expensive, and measurably worse at the task you originally designed it for.

The AI Changelog Problem: Why Your Prompt Updates Are Breaking Other Teams

· 11 min read
Tian Pan
Software Engineer

A platform team ships a one-line tweak to the system prompt of their summarization service. No code review, no migration guide, no version bump — it's "just a prompt." Two weeks later, the legal product team finds out their compliance auto-redaction has been silently letting names through. The investigation eats a sprint. The fix is trivial. The damage is the trust.

This is the AI changelog problem in miniature. Behavior is now a first-class output of your system, and behavior changes when prompts, models, retrievers, or tool schemas change — none of which show up in git diff of the consuming application. Teams that treat AI updates like backend deploys, where a Slack message in #releases is enough, end up reinventing the worst parts of the early-2010s "we'll just push and tell QA later" workflow.

Design Your Agent State Machine Before You Write a Single Prompt

· 10 min read
Tian Pan
Software Engineer

Most engineers building their first LLM agent follow the same sequence: write a system prompt, add a loop that calls the model, sprinkle in some tool-calling logic, and watch it work on a simple test case. Six weeks later, the agent is an incomprehensible tangle of nested conditionals, prompt fragments pasted inside f-strings, and retry logic scattered across three files. Adding a feature requires reading the whole thing. A production bug means staring at a thousand-token context window trying to reconstruct what the model was "thinking."

This is the spaghetti agent problem, and it's nearly universal in teams that start with a prompt rather than a design. The fix isn't a better prompting technique or a different framework. It's a discipline: design your state machine before you write a single prompt.

The Attribution Gap: How to Trace a User Complaint Back to a Specific Model Decision

· 12 min read
Tian Pan
Software Engineer

A support ticket arrives: "Your AI gave me completely wrong advice about my insurance policy." You check the logs. You find a timestamp and a user ID. The actual model response is there, printed verbatim. But you have no idea which prompt version produced it, which context chunks were retrieved, whether a tool was called mid-chain, or which of the three model versions you've deployed in the past month actually handled that request. You can read the output. You cannot explain it.

This is the attribution gap — and it's the operational problem most AI teams hit six to eighteen months after they first ship a model-backed feature. The failure isn't in the model or the prompt; it's in the observability infrastructure. Traditional logging captures request-response pairs. LLM pipelines are not request-response pairs. They're decision trees: context retrieval, prompt assembly, optional tool calls, model inference, post-processing, conditional branching. When something goes wrong, you need the full tree, not just the leaf.

AI Code Review in Practice: What Automated PR Analysis Actually Catches and Consistently Misses

· 9 min read
Tian Pan
Software Engineer

Forty-seven percent of professional developers now use AI code review tools—up from 22% two years ago. Yet in the same period, AI-coauthored PRs have accumulated 1.7 times more post-merge bugs than human-written code, and change failure rates across the industry have climbed 30%. Something is wrong with how teams are deploying these tools, and the problem isn't the tools themselves.

The core issue is that engineers adopted AI review without understanding its capability profile. These systems operate at a 50–60% effectiveness ceiling on realistic codebases, excel at a narrow class of surface-level problems, and fail silently on exactly the errors that cause production incidents. Teams that treat AI review as a general-purpose quality gate get false confidence instead of actual coverage.

Why AI Feature Flags Are Not Regular Feature Flags

· 11 min read
Tian Pan
Software Engineer

Your canary deployment worked perfectly. Error rates stayed flat. Latency didn't spike. The dashboard showed green across the board. You rolled the new model out to 100% of traffic — and three weeks later your support queue filled up with users complaining that the AI "felt off" and "stopped being helpful."

This is the core problem with applying traditional feature flag mechanics to AI systems. A model can be degraded without being broken. It returns 200s, generates tokens at normal speed, and produces text that passes superficial validation — while simultaneously hallucinating more often, drifting toward terse or evasive answers, or regressing on the subtle reasoning patterns your users actually depend on. The telemetry you've been monitoring for years was never designed to catch this kind of failure.

AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause

· 10 min read
Tian Pan
Software Engineer

Your customer support AI told a passenger he could buy a full-fare ticket and claim a retroactive bereavement discount afterward. He trusted it, flew, and filed the claim. The company denied it. A tribunal ruled the company liable for $650 anyway — because there was no distinction in the law between a human employee and a chatbot giving authoritative-sounding advice. The chatbot wasn't crashing. No alerts fired. No p99 latency spiked. The system was "working."

That is the defining characteristic of AI incidents: the application doesn't fail — it succeeds at producing the wrong output, confidently and at scale. And when you sit down to write the post-mortem, the classical toolbox falls apart.

Amortizing Context: Persistent Agent Memory vs. Long-Context Windows

· 9 min read
Tian Pan
Software Engineer

When 1 million-token context windows became commercially available, a lot of teams quietly decided they'd solved agent memory. Why build a retrieval system, manage a vector database, or design an eviction policy when you can just dump everything in and let the model sort it out? The answer comes back in your infrastructure bill. At 10,000 daily interactions with a 100k-token knowledge base, the brute-force in-context approach costs roughly $5,000/day. A retrieval-augmented memory system handling the same load costs around $333/day — a 15x gap that compounds as your user base grows.

The real problem isn't just cost. It's that longer contexts produce measurably worse answers. Research consistently shows that models lose track of information positioned in the middle of very long inputs, accuracy drops predictably when relevant evidence is buried among irrelevant chunks, and latency climbs in ways that make interactive agents feel broken. The "stuff everything in" approach doesn't just waste money — it trades accuracy for the illusion of simplicity.

Behavioral Signals That Actually Measure User Satisfaction in AI Products

· 9 min read
Tian Pan
Software Engineer

Most AI product teams ship a thumbs-up/thumbs-down widget and call it a satisfaction measurement system. They are measuring something — just not satisfaction.

A developer who presses thumbs-down on a Copilot suggestion because the function signature is wrong, and a developer who presses thumbs-down because the suggestion was excellent but not what they needed right now, are generating the same signal. Meanwhile, the developer who quietly regenerated the response four times before giving up generates no explicit signal at all. That absent signal is a better predictor of churn than anything the rating widget captures.

The implicit behavioral record your users leave while using your AI product is richer, more honest, and more actionable than anything they'll type or tap voluntarily. This post covers which signals to collect, why they outperform explicit feedback, and the instrumentation schema that keeps AI-specific telemetry from poisoning your general product analytics.

Cache Invalidation for AI: Why Every Cache Layer Gets Harder When the Answer Can Change

· 10 min read
Tian Pan
Software Engineer

Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — was coined before language models entered production. Add AI to the stack and cache invalidation doesn't just get harder; it gets harder at every layer simultaneously, for fundamentally different reasons at each one.

Traditional caches store deterministic outputs: the database row, the rendered HTML, the computed price. When the source changes, you invalidate the key, and the next request fetches fresh data. The contract is simple because the answer is a fact.

AI caches store something different: responses to queries where the "correct" answer depends on context, recency, model behavior, and the source documents the model was given. Stale here doesn't mean outdated — it means semantically wrong in ways your monitoring won't catch until a user notices.