Skip to main content

553 posts tagged with "ai-engineering"

View all tags

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

· 11 min read
Tian Pan
Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

LLM-as-Compiler Is a Metaphor Your Codebase Can't Survive

· 10 min read
Tian Pan
Software Engineer

The pitch is seductive: describe the behavior in English, the model emits the code, ship it. Prompts become the source, artifacts become the target, and the LLM sits between them like gcc with a friendlier front-end. If that framing held, the rest of software engineering — review, refactoring, architecture — would be downstream of prompt quality. It does not hold. And the codebases built on the assumption that it does start failing in a pattern that is now boring to diagnose: around month six, nobody can explain why a particular function looks the way it does, and every incremental change produces a wave of duplicates.

The compiler metaphor is the root cause, not vibe coding, not model quality, not prompt skill. It is a category error that quietly excuses teams from doing the work that keeps a codebase coherent over years. When you believe the model is a compiler, the generated code is an implementation detail, the same way assembly is an implementation detail of a C program. When you are actually running a team of non-deterministic, context-limited collaborators, the generated code is the asset — and the prompts are closer to Slack messages than to source.

LLM-as-Judge Drift: When Your Evaluator Upgrades and All Your Numbers Move

· 11 min read
Tian Pan
Software Engineer

A regression suite that flips green-to-red without a single prompt change is usually one of three things: a broken test harness, a flaky retrieval store, or a judge that learned new taste over the weekend. The third one is the most common and the least debugged, because no commit in your repo caused it. The scoring model got a silent quality refresh, and every score you compare against last month's dashboard is now denominated in a different currency.

This is the uncomfortable part of LLM-as-judge: you have two moving models, not one. The candidate model is the thing you ship; the judge model is the thing that tells you how the candidate is doing. When both evolve independently, score deltas stop meaning what they used to, and the dashboard that your PM refreshes every morning quietly lies.

Markdown Beats JSON: The Output Format Tax You're Paying Without Measuring

· 11 min read
Tian Pan
Software Engineer

Most teams flip JSON mode on the day they ship and never measure what it costs them. The assumption is reasonable: structured output is a correctness win, so why wouldn't you take it? The answer is that strict JSON-mode constrained decoding routinely shaves 5–15% off reasoning accuracy on math, symbolic, and multi-step analysis tasks, and nobody notices because the evals were run before the format flag was flipped — or the evals measure parseability, not quality.

The output format is a decoding-time constraint, and like every constraint it warps the model's probability distribution. The warp is invisible when you look at logs: the JSON is valid, the schema matches, the field types line up. What you cannot see in the logs is the reasoning that the model would have produced in prose but could not fit inside the grammar you gave it. The format tax is real, well-documented in the literature, and almost universally unmeasured in production.

This post is about when to pay it, how to stop paying it when you don't have to, and what a format-choice decision tree actually looks like for engineers who want structured output and accuracy at the same time.

Mid-Flight Steering: Redirecting a Long-Running Agent Without Killing the Run

· 10 min read
Tian Pan
Software Engineer

Watch a developer use an agentic IDE for twenty minutes and you will see the same micro-drama play out three times. The agent starts a long task. Two tool calls in, the user realizes they want a functional component instead of a class, or a v2 endpoint instead of v1, or tests written in Vitest instead of Jest. They have exactly one lever: the red stop button. They press it. The agent dies mid-edit. They copy-paste the last prompt, append the correction, and pay for the first eight minutes of work twice.

The abort button is the wrong affordance. It treats "I want to adjust the plan" and "I want to throw away the run" as the same gesture. In practice they are as different as a steering wheel and an ejector seat, and conflating them is why so many agent products feel brittle the moment a task takes longer than a single screen of output.

Eval Passed, With All Tools Mocked: Why Your Agent's Hardest Failures Never Reach the Harness

· 9 min read
Tian Pan
Software Engineer

Your agent hits 94% on the eval suite. Your on-call rotation is on fire. Nobody in the room is lying; both numbers are honest. What's happening is that the harness is testing a prompt, and production is testing an agent, and those are two different artifacts that happen to share weights.

Mocked-tool evals are almost always how this gap opens. You stub search_orders, charge_card, and send_email with canned JSON, feed the model a user turn, and assert on the final response. The run is cheap, deterministic, and reproducible — every property a CI system loves. It is also silent on tool selection, latency, rate limits, partial failures, and retry behavior, which is to say silent on the set of failures that dominate post-incident reviews.

The Orphan Adapter Problem: When Your Fine-Tune Outlives Its Base Model

· 12 min read
Tian Pan
Software Engineer

A senior engineer left six months ago. She owned the classifier adapter that routes customer support tickets — a 32-rank LoRA trained on 847 hand-labeled examples, pinned to a base model that hits end-of-life in 43 days. Nobody remembers why those 847 examples were chosen over the 2,000 they started with. The training data sits in an S3 bucket whose lifecycle policy purges objects older than one year. Her laptop was wiped. The fine-tuning notebook has a cell that calls a preprocessing function she imported from her personal dotfiles repo, now private.

This is the orphan adapter — a fine-tune that outlived its maintainers, outlived its data, and is about to outlive the base model it was trained on. It sits in your production stack, routing real user traffic, and nobody left on the team can rebuild it. The deprecation email didn't create this crisis. It just exposed it.

Pattern-Matching Failures: When Your LLM Solves the Wrong Problem Fluently

· 11 min read
Tian Pan
Software Engineer

A user pastes a long, complicated bug report into your AI assistant. It looks like a classic null-pointer question, with the same phrasing and code layout as thousands of Stack Overflow posts. The model responds confidently, cites the usual fix, and sounds authoritative. The user thanks it. The bug is still there. The report was actually about a race condition; the null-pointer framing was incidental to how the user described the symptom.

This is the single hardest bug class to catch in a production LLM system. The model did not refuse. It did not hedge. It did not hallucinate a fake API. It solved the wrong problem, fluently, and everyone downstream — the user, your eval pipeline, your guardrails — saw a plausible on-topic answer and moved on. I call these pattern-matching failures: the model latched onto surface features of the query and produced a confident answer to something adjacent to what was actually asked.

Your Prompt Is Competing With What the Model Already Knows

· 11 min read
Tian Pan
Software Engineer

The frontier model you just wired up has opinions about your competitors. It has a default answer to the hard question your product was built to disagree with. It has a "best practice" for your domain that came from whatever happened to dominate the training corpus, and a quiet preference for the conventional take on every controversial call your team agonized over in the design doc. None of that is in your system prompt. You did not write any of it. And on the queries where your differentiation actually lives, the model will reach for those defaults before it reaches for what you told it.

Most teams ship as if the model is a configurable blank slate. Write the persona, list the rules, paste the brand voice guidelines, run a few QA prompts that produce the right shape of answer, and call it done. The prompts that get reviewed are the prompts that hit easy queries — the ones where the model's prior happens to align with what you wanted anyway. The interesting queries, the ones where your product would lose badly if it produced the generic answer, almost never make it into the prompt-iteration loop. Those are the queries where the prior wins silently.

Your RAG Chunker Is a Database Schema Nobody Code-Reviewed

· 11 min read
Tian Pan
Software Engineer

The first time a retrieval quality regression lands in your on-call channel, the debugging path almost always leads somewhere surprising. Not the embedding model. Not the reranker. Not the prompt. The culprit is a one-line change to the chunker — a tokenizer swap, a boundary rule tweak, a stride adjustment — that someone merged into a preprocessing notebook three sprints ago. The fix touched zero lines of production code. It rebuilt the index overnight. And now accuracy is down four points across every tenant.

The chunker is a database schema. Every field you extract, every boundary you draw, every stride you pick defines the shape of the rows that land in your vector index. Change any of them and you have altered the schema of an index that other parts of your system — retrieval logic, reranker features, evaluation harnesses, downstream prompts — depend on as if it were stable. But because the chunker usually lives in a notebook or a small Python module that nobody labels as "infrastructure," these changes ship with the rigor of a config tweak and the blast radius of an ALTER TABLE.

Why Your RAG Citations Are Lying: Post-Hoc Rationalization in Source Attribution

· 10 min read
Tian Pan
Software Engineer

Show a user an AI answer with a link at the end of each sentence, and the needle on their trust meter swings halfway across the dial before they have read a single cited passage. That is the whole marketing pitch of enterprise RAG: "grounded," "sourced," "verifiable." It is also the most-shipped, least-tested claim in AI engineering. Recent benchmarks find that between 50% and 90% of LLM responses are not fully supported — and sometimes contradicted — by the sources they cite. On adversarial evaluation sets, up to 57% of citations from state-of-the-art models are unfaithful: the model never actually used the document it is pointing at. The citation was attached after the fact, to rationalize an answer the model had already decided to give.

This is not a retrieval bug. You can have perfect retrieval and still get lying citations, because the failure is architectural. The generator writes prose first and stitches links on second. The links look like evidence. They are decoration.

The Reflection Placebo: Why Plan-Reflect-Replan Loops Return Version One

· 9 min read
Tian Pan
Software Engineer

Open an agent's trace during a long-horizon planning task and count the number of times the model writes "let me reconsider," "on reflection," or "a better approach would be." Now compare the plan it finally commits to with the one it drafted first. In the majority of traces I've audited, the second plan is the first plan wearing a different hat — the same decomposition, the same tool calls, the same order of operations, with some renamed step labels and a reworded rationale. The reflection ran. The model emitted tokens that looked like reconsideration. The plan did not move.

This matters because "with reflection" has quietly become a quality tier. Teams ship planners with one, two, or three reflection rounds and bill themselves for the difference. The inference spend is real and measurable. Whether anything on the plan side actually changed is a question almost nobody instruments for, and the answer is frequently: no.