Skip to main content

369 posts tagged with "ai-engineering"

View all tags

Mid-Flight Steering: Redirecting a Long-Running Agent Without Killing the Run

· 10 min read
Tian Pan
Software Engineer

Watch a developer use an agentic IDE for twenty minutes and you will see the same micro-drama play out three times. The agent starts a long task. Two tool calls in, the user realizes they want a functional component instead of a class, or a v2 endpoint instead of v1, or tests written in Vitest instead of Jest. They have exactly one lever: the red stop button. They press it. The agent dies mid-edit. They copy-paste the last prompt, append the correction, and pay for the first eight minutes of work twice.

The abort button is the wrong affordance. It treats "I want to adjust the plan" and "I want to throw away the run" as the same gesture. In practice they are as different as a steering wheel and an ejector seat, and conflating them is why so many agent products feel brittle the moment a task takes longer than a single screen of output.

Eval Passed, With All Tools Mocked: Why Your Agent's Hardest Failures Never Reach the Harness

· 9 min read
Tian Pan
Software Engineer

Your agent hits 94% on the eval suite. Your on-call rotation is on fire. Nobody in the room is lying; both numbers are honest. What's happening is that the harness is testing a prompt, and production is testing an agent, and those are two different artifacts that happen to share weights.

Mocked-tool evals are almost always how this gap opens. You stub search_orders, charge_card, and send_email with canned JSON, feed the model a user turn, and assert on the final response. The run is cheap, deterministic, and reproducible — every property a CI system loves. It is also silent on tool selection, latency, rate limits, partial failures, and retry behavior, which is to say silent on the set of failures that dominate post-incident reviews.

The Orphan Adapter Problem: When Your Fine-Tune Outlives Its Base Model

· 12 min read
Tian Pan
Software Engineer

A senior engineer left six months ago. She owned the classifier adapter that routes customer support tickets — a 32-rank LoRA trained on 847 hand-labeled examples, pinned to a base model that hits end-of-life in 43 days. Nobody remembers why those 847 examples were chosen over the 2,000 they started with. The training data sits in an S3 bucket whose lifecycle policy purges objects older than one year. Her laptop was wiped. The fine-tuning notebook has a cell that calls a preprocessing function she imported from her personal dotfiles repo, now private.

This is the orphan adapter — a fine-tune that outlived its maintainers, outlived its data, and is about to outlive the base model it was trained on. It sits in your production stack, routing real user traffic, and nobody left on the team can rebuild it. The deprecation email didn't create this crisis. It just exposed it.

Pattern-Matching Failures: When Your LLM Solves the Wrong Problem Fluently

· 11 min read
Tian Pan
Software Engineer

A user pastes a long, complicated bug report into your AI assistant. It looks like a classic null-pointer question, with the same phrasing and code layout as thousands of Stack Overflow posts. The model responds confidently, cites the usual fix, and sounds authoritative. The user thanks it. The bug is still there. The report was actually about a race condition; the null-pointer framing was incidental to how the user described the symptom.

This is the single hardest bug class to catch in a production LLM system. The model did not refuse. It did not hedge. It did not hallucinate a fake API. It solved the wrong problem, fluently, and everyone downstream — the user, your eval pipeline, your guardrails — saw a plausible on-topic answer and moved on. I call these pattern-matching failures: the model latched onto surface features of the query and produced a confident answer to something adjacent to what was actually asked.

Your Prompt Is Competing With What the Model Already Knows

· 11 min read
Tian Pan
Software Engineer

The frontier model you just wired up has opinions about your competitors. It has a default answer to the hard question your product was built to disagree with. It has a "best practice" for your domain that came from whatever happened to dominate the training corpus, and a quiet preference for the conventional take on every controversial call your team agonized over in the design doc. None of that is in your system prompt. You did not write any of it. And on the queries where your differentiation actually lives, the model will reach for those defaults before it reaches for what you told it.

Most teams ship as if the model is a configurable blank slate. Write the persona, list the rules, paste the brand voice guidelines, run a few QA prompts that produce the right shape of answer, and call it done. The prompts that get reviewed are the prompts that hit easy queries — the ones where the model's prior happens to align with what you wanted anyway. The interesting queries, the ones where your product would lose badly if it produced the generic answer, almost never make it into the prompt-iteration loop. Those are the queries where the prior wins silently.

Your RAG Chunker Is a Database Schema Nobody Code-Reviewed

· 11 min read
Tian Pan
Software Engineer

The first time a retrieval quality regression lands in your on-call channel, the debugging path almost always leads somewhere surprising. Not the embedding model. Not the reranker. Not the prompt. The culprit is a one-line change to the chunker — a tokenizer swap, a boundary rule tweak, a stride adjustment — that someone merged into a preprocessing notebook three sprints ago. The fix touched zero lines of production code. It rebuilt the index overnight. And now accuracy is down four points across every tenant.

The chunker is a database schema. Every field you extract, every boundary you draw, every stride you pick defines the shape of the rows that land in your vector index. Change any of them and you have altered the schema of an index that other parts of your system — retrieval logic, reranker features, evaluation harnesses, downstream prompts — depend on as if it were stable. But because the chunker usually lives in a notebook or a small Python module that nobody labels as "infrastructure," these changes ship with the rigor of a config tweak and the blast radius of an ALTER TABLE.

Why Your RAG Citations Are Lying: Post-Hoc Rationalization in Source Attribution

· 10 min read
Tian Pan
Software Engineer

Show a user an AI answer with a link at the end of each sentence, and the needle on their trust meter swings halfway across the dial before they have read a single cited passage. That is the whole marketing pitch of enterprise RAG: "grounded," "sourced," "verifiable." It is also the most-shipped, least-tested claim in AI engineering. Recent benchmarks find that between 50% and 90% of LLM responses are not fully supported — and sometimes contradicted — by the sources they cite. On adversarial evaluation sets, up to 57% of citations from state-of-the-art models are unfaithful: the model never actually used the document it is pointing at. The citation was attached after the fact, to rationalize an answer the model had already decided to give.

This is not a retrieval bug. You can have perfect retrieval and still get lying citations, because the failure is architectural. The generator writes prose first and stitches links on second. The links look like evidence. They are decoration.

The Reflection Placebo: Why Plan-Reflect-Replan Loops Return Version One

· 9 min read
Tian Pan
Software Engineer

Open an agent's trace during a long-horizon planning task and count the number of times the model writes "let me reconsider," "on reflection," or "a better approach would be." Now compare the plan it finally commits to with the one it drafted first. In the majority of traces I've audited, the second plan is the first plan wearing a different hat — the same decomposition, the same tool calls, the same order of operations, with some renamed step labels and a reworded rationale. The reflection ran. The model emitted tokens that looked like reconsideration. The plan did not move.

This matters because "with reflection" has quietly become a quality tier. Teams ship planners with one, two, or three reflection rounds and bill themselves for the difference. The inference spend is real and measurable. Whether anything on the plan side actually changed is a question almost nobody instruments for, and the answer is frequently: no.

The Right-Edge Accuracy Drop: Why the Last 20% of Your Context Window Is a Trap

· 11 min read
Tian Pan
Software Engineer

A 200K-token context window is not a 200K-token context window. Fill it to the brim and the model you just paid for quietly becomes a worse version of itself — not at the middle, where "lost in the middle" would predict, but at the right edge, exactly where recency bias was supposed to save you. The label on the box sold you headroom; the silicon sells you a cliff.

This is a different failure mode from the one most teams have internalized. "Lost in the middle" trained a generation of prompt engineers to stuff the critical instruction at the top and the critical question at the bottom, confident that primacy and recency would carry the signal through. That heuristic silently breaks when utilization approaches the claimed window. The drop-off is not gradual, not linear, and not symmetric with how the model behaves at half-fill. Past a utilization threshold that varies by model, you are operating in a different regime, and the prompt shape that worked at 30K fails at 180K.

The economic temptation makes it worse. If you just paid for a million-token window, the pressure to use it is enormous — dump the entire repo, feed it every support ticket, hand it the quarterly filings and let it figure out what matters. That is how you get a confidently wrong answer that looks well-reasoned on the surface and disintegrates on audit.

The Rubber-Stamp Collapse: Why AI-Authored PRs Are Hollowing Out Code Review

· 10 min read
Tian Pan
Software Engineer

A senior engineer approves a 400-line PR in four minutes. The diff is clean. Names are sensible. Tests pass. Two weeks later the on-call engineer is paging through a query that returns the right shape of rows but from the wrong column — user.updated_at where user.created_at was meant — and the cohort analysis dashboard has been quietly lying to the CFO for nine days. The reviewer was competent. The code was well-structured. The bug was invisible in the diff because it wasn't a syntactic smell. It was a semantic one, and the reviewer had nothing to anchor against because no one had written down what the change was supposed to do.

This is the failure mode that shows up once the majority of diffs in your repo start life as model output. Reviewers stop asking "is this correct?" and start asking "does this look like code?" The answer is almost always yes. AI-authored code is grammatically fluent in a way that bypasses the review heuristics engineers spent a decade sharpening on human-written slop.

Semantic Cache Is a Safety Problem, Not a Perf Win

· 12 min read
Tian Pan
Software Engineer

A semantic cache hit is the only LLM optimization that can serve the wrong answer to the wrong user in under a millisecond. SQL caches return your row or someone else's because somebody wrote a bad join — the failure mode is a query bug. Semantic caches return another tenant's response because two embeddings landed within 0.03 cosine of each other, which is the system working exactly as designed. The cache is doing its job. The job is the problem.

Most teams ship semantic caching as a cost initiative — there's a "70% bill reduction" deck floating around every AI engineering Slack — and review the cache key the way they'd review a Redis TTL: not at all. That review goes to the perf team. The safety team never sees the design doc because nobody filed a security review for "we added a faster path." Six months later somebody's compliance audit finds that "I can't log into my account, my email is [email protected]" and "I can't log into my account, my email is [email protected]" both vectorized within threshold of "I can't log into my account" and the cache cheerfully served Bob the response originally generated for Jane, including the password reset link her account had asked for.

This post is about why semantic caches deserve the same review rigor as SQL predicates, the cache-key design that prevents cross-user leak by construction, and the audit trail you need to distinguish "cache hit served the right answer" from "cache hit served someone else's answer at sub-millisecond latency."

The Ship-and-Pin Trap: How Model Version Stability Becomes Deprecation Debt

· 9 min read
Tian Pan
Software Engineer

Pinning a model version in production feels like engineering discipline. You lock claude-opus-4-0 or gpt-4o-2024-08-06 into config, write a note in the README about why, and move on to shipping features. The output distribution stops shifting under you, the evals stay green, and the prompt tuning you did last quarter keeps working. What you've actually done is start a silent timer. Twelve to fifteen months later the deprecation email arrives, and three sprints of undocumented behavioral dependencies — prompt tuning, eval calibration, output shape assumptions, temperature quirks — all come due at once.

This is the ship-and-pin trap. Pinning is correct in the short term and catastrophic in the long term, because the cost of stability compounds in places you aren't looking. The prompt that was "good enough" a year ago is now load-bearing in ways nobody documented. The JSON schema your downstream service expects was shaped to one model's tokenization habits. The few-shot examples you hand-tuned were tuned against a specific model's notion of helpfulness. When the provider retires the version string, none of these dependencies migrate automatically, and the work to re-qualify them always lands under deadline pressure.