Skip to main content

553 posts tagged with "ai-engineering"

View all tags

The Right-Edge Accuracy Drop: Why the Last 20% of Your Context Window Is a Trap

· 11 min read
Tian Pan
Software Engineer

A 200K-token context window is not a 200K-token context window. Fill it to the brim and the model you just paid for quietly becomes a worse version of itself — not at the middle, where "lost in the middle" would predict, but at the right edge, exactly where recency bias was supposed to save you. The label on the box sold you headroom; the silicon sells you a cliff.

This is a different failure mode from the one most teams have internalized. "Lost in the middle" trained a generation of prompt engineers to stuff the critical instruction at the top and the critical question at the bottom, confident that primacy and recency would carry the signal through. That heuristic silently breaks when utilization approaches the claimed window. The drop-off is not gradual, not linear, and not symmetric with how the model behaves at half-fill. Past a utilization threshold that varies by model, you are operating in a different regime, and the prompt shape that worked at 30K fails at 180K.

The economic temptation makes it worse. If you just paid for a million-token window, the pressure to use it is enormous — dump the entire repo, feed it every support ticket, hand it the quarterly filings and let it figure out what matters. That is how you get a confidently wrong answer that looks well-reasoned on the surface and disintegrates on audit.

The Rubber-Stamp Collapse: Why AI-Authored PRs Are Hollowing Out Code Review

· 10 min read
Tian Pan
Software Engineer

A senior engineer approves a 400-line PR in four minutes. The diff is clean. Names are sensible. Tests pass. Two weeks later the on-call engineer is paging through a query that returns the right shape of rows but from the wrong column — user.updated_at where user.created_at was meant — and the cohort analysis dashboard has been quietly lying to the CFO for nine days. The reviewer was competent. The code was well-structured. The bug was invisible in the diff because it wasn't a syntactic smell. It was a semantic one, and the reviewer had nothing to anchor against because no one had written down what the change was supposed to do.

This is the failure mode that shows up once the majority of diffs in your repo start life as model output. Reviewers stop asking "is this correct?" and start asking "does this look like code?" The answer is almost always yes. AI-authored code is grammatically fluent in a way that bypasses the review heuristics engineers spent a decade sharpening on human-written slop.

Semantic Cache Is a Safety Problem, Not a Perf Win

· 12 min read
Tian Pan
Software Engineer

A semantic cache hit is the only LLM optimization that can serve the wrong answer to the wrong user in under a millisecond. SQL caches return your row or someone else's because somebody wrote a bad join — the failure mode is a query bug. Semantic caches return another tenant's response because two embeddings landed within 0.03 cosine of each other, which is the system working exactly as designed. The cache is doing its job. The job is the problem.

Most teams ship semantic caching as a cost initiative — there's a "70% bill reduction" deck floating around every AI engineering Slack — and review the cache key the way they'd review a Redis TTL: not at all. That review goes to the perf team. The safety team never sees the design doc because nobody filed a security review for "we added a faster path." Six months later somebody's compliance audit finds that "I can't log into my account, my email is [email protected]" and "I can't log into my account, my email is [email protected]" both vectorized within threshold of "I can't log into my account" and the cache cheerfully served Bob the response originally generated for Jane, including the password reset link her account had asked for.

This post is about why semantic caches deserve the same review rigor as SQL predicates, the cache-key design that prevents cross-user leak by construction, and the audit trail you need to distinguish "cache hit served the right answer" from "cache hit served someone else's answer at sub-millisecond latency."

The Ship-and-Pin Trap: How Model Version Stability Becomes Deprecation Debt

· 9 min read
Tian Pan
Software Engineer

Pinning a model version in production feels like engineering discipline. You lock claude-opus-4-0 or gpt-4o-2024-08-06 into config, write a note in the README about why, and move on to shipping features. The output distribution stops shifting under you, the evals stay green, and the prompt tuning you did last quarter keeps working. What you've actually done is start a silent timer. Twelve to fifteen months later the deprecation email arrives, and three sprints of undocumented behavioral dependencies — prompt tuning, eval calibration, output shape assumptions, temperature quirks — all come due at once.

This is the ship-and-pin trap. Pinning is correct in the short term and catastrophic in the long term, because the cost of stability compounds in places you aren't looking. The prompt that was "good enough" a year ago is now load-bearing in ways nobody documented. The JSON schema your downstream service expects was shaped to one model's tokenization habits. The few-shot examples you hand-tuned were tuned against a specific model's notion of helpfulness. When the provider retires the version string, none of these dependencies migrate automatically, and the work to re-qualify them always lands under deadline pressure.

The Synthetic Preference Trap: How AI-Ranked RLHF Quietly Drifts Your Model Into the Teacher's Voice

· 12 min read
Tian Pan
Software Engineer

The first sign is almost always the same: your internal eval dashboard is green, reward-model scores are climbing, DPO loss is trending right — and a customer on a Zoom call shrugs and says "it sounds like ChatGPT now." No one on the training team wants to hear that. The evals say the model is better. The annotators who shipped the last batch of preferences say the model is better. But the user is telling you the truth, and the dashboard is lying. What broke is not any single label. What broke is that your preference data is no longer yours.

This is the synthetic preference trap. Label budgets get squeezed, someone proposes using a stronger model to rank a second model's completions, the experiment ships, and for a while it looks like a free lunch. The student model learns to sound more like the teacher on every turn, and because your reward model was trained on data the teacher also influenced, your reward model cheerfully agrees. The user sees a product that reads exactly like every other product built on top of the same frontier API. The differentiation you thought you were buying with fine-tuning has been quietly distilled away.

Your Tool Descriptions Are Prompts, Not API Docs

· 10 min read
Tian Pan
Software Engineer

The tool description is not documentation. It is the prompt the model reads, every single turn, to decide whether this tool fires and how. You are not writing for the developer integrating against the tool — the developer already has the schema, the types, the examples in the PR. You are writing for a stochastic reader that has never seen this codebase, is holding twenty other tool descriptions in the same context window, and has to pick one in the next forward pass.

Most teams don't. They paste the OpenAPI summary into the description field, stick the JSON Schema under it, and ship. Then the agent undercalls the tool, confidently calls the wrong adjacent tool, or fires the right tool with parameters that were "obviously" wrong to any human reading the schema. The team blames the model. The model was reading exactly what you wrote.

The Validator Trap: How Post-Hoc Guards Rot Your Prompt From the Inside

· 9 min read
Tian Pan
Software Engineer

The first time a validator catches a bad LLM output, it feels like a win. The second time, you tweak the prompt to make the failure less likely. By the twentieth time, nobody on the team can explain why three paragraphs of the prompt exist — they are scar tissue from incidents long forgotten, and the model is spending more tokens reading warnings than reasoning about the actual task.

This is the validator trap. Every post-hoc guard you add — a JSON schema check, a regex, a content classifier, a second LLM-as-judge — exerts feedback pressure on the upstream prompt. The prompt grows defensive instructions to appease the guard, the guard in turn catches a new class of failure, and you add more instructions. Each iteration looks local and sensible. In aggregate, the system gets slower, more expensive, and measurably worse at the task you originally designed it for.

The AI Changelog Problem: Why Your Prompt Updates Are Breaking Other Teams

· 11 min read
Tian Pan
Software Engineer

A platform team ships a one-line tweak to the system prompt of their summarization service. No code review, no migration guide, no version bump — it's "just a prompt." Two weeks later, the legal product team finds out their compliance auto-redaction has been silently letting names through. The investigation eats a sprint. The fix is trivial. The damage is the trust.

This is the AI changelog problem in miniature. Behavior is now a first-class output of your system, and behavior changes when prompts, models, retrievers, or tool schemas change — none of which show up in git diff of the consuming application. Teams that treat AI updates like backend deploys, where a Slack message in #releases is enough, end up reinventing the worst parts of the early-2010s "we'll just push and tell QA later" workflow.

Design Your Agent State Machine Before You Write a Single Prompt

· 10 min read
Tian Pan
Software Engineer

Most engineers building their first LLM agent follow the same sequence: write a system prompt, add a loop that calls the model, sprinkle in some tool-calling logic, and watch it work on a simple test case. Six weeks later, the agent is an incomprehensible tangle of nested conditionals, prompt fragments pasted inside f-strings, and retry logic scattered across three files. Adding a feature requires reading the whole thing. A production bug means staring at a thousand-token context window trying to reconstruct what the model was "thinking."

This is the spaghetti agent problem, and it's nearly universal in teams that start with a prompt rather than a design. The fix isn't a better prompting technique or a different framework. It's a discipline: design your state machine before you write a single prompt.

The Attribution Gap: How to Trace a User Complaint Back to a Specific Model Decision

· 12 min read
Tian Pan
Software Engineer

A support ticket arrives: "Your AI gave me completely wrong advice about my insurance policy." You check the logs. You find a timestamp and a user ID. The actual model response is there, printed verbatim. But you have no idea which prompt version produced it, which context chunks were retrieved, whether a tool was called mid-chain, or which of the three model versions you've deployed in the past month actually handled that request. You can read the output. You cannot explain it.

This is the attribution gap — and it's the operational problem most AI teams hit six to eighteen months after they first ship a model-backed feature. The failure isn't in the model or the prompt; it's in the observability infrastructure. Traditional logging captures request-response pairs. LLM pipelines are not request-response pairs. They're decision trees: context retrieval, prompt assembly, optional tool calls, model inference, post-processing, conditional branching. When something goes wrong, you need the full tree, not just the leaf.

AI Code Review in Practice: What Automated PR Analysis Actually Catches and Consistently Misses

· 9 min read
Tian Pan
Software Engineer

Forty-seven percent of professional developers now use AI code review tools—up from 22% two years ago. Yet in the same period, AI-coauthored PRs have accumulated 1.7 times more post-merge bugs than human-written code, and change failure rates across the industry have climbed 30%. Something is wrong with how teams are deploying these tools, and the problem isn't the tools themselves.

The core issue is that engineers adopted AI review without understanding its capability profile. These systems operate at a 50–60% effectiveness ceiling on realistic codebases, excel at a narrow class of surface-level problems, and fail silently on exactly the errors that cause production incidents. Teams that treat AI review as a general-purpose quality gate get false confidence instead of actual coverage.

Why AI Feature Flags Are Not Regular Feature Flags

· 11 min read
Tian Pan
Software Engineer

Your canary deployment worked perfectly. Error rates stayed flat. Latency didn't spike. The dashboard showed green across the board. You rolled the new model out to 100% of traffic — and three weeks later your support queue filled up with users complaining that the AI "felt off" and "stopped being helpful."

This is the core problem with applying traditional feature flag mechanics to AI systems. A model can be degraded without being broken. It returns 200s, generates tokens at normal speed, and produces text that passes superficial validation — while simultaneously hallucinating more often, drifting toward terse or evasive answers, or regressing on the subtle reasoning patterns your users actually depend on. The telemetry you've been monitoring for years was never designed to catch this kind of failure.