Skip to main content

The AI Engineering Perf Packet: Making Stochastic Work Legible at Promotion Review

· 11 min read
Tian Pan
Software Engineer

A senior engineer walks into the promotion calibration meeting. They shipped a fine-tuned reranker that lifted retrieval quality eight points. They built the eval harness that turned a two-week QA cycle into a one-hour CI gate. They authored the prompt change that drove a two-point conversion lift. By any reasonable measure, they had a defining year.

They don't get promoted. The packet, as written, reads like "I tuned some numbers." The colleague next to them — who shipped a CRUD feature behind a launch banner with QPS, latency, and a Friday demo — gets the nod instead. The committee is not malicious. It is using a vocabulary it has, applied to a packet that didn't translate the work into that vocabulary.

This failure mode is now common enough to be a pattern. AI engineering work doesn't decompose cleanly into the artifacts that calibration committees were trained to evaluate. The packet template was written for deterministic systems shipped in deterministic ways, and the engineers who do the most leveraged work in the AI stack are paying the tax.

The Packet Template Was Written for a Different Kind of Work

The standard senior-engineer packet has three load-bearing artifacts: a shipped feature with a launch date, a system you owned with QPS or SLO numbers, and a counterfactual the committee can verify ("if you hadn't done this, X would have broken"). It assumes a world where work compiles cleanly into demos, dashboards, and incidents.

Now look at the AI engineer's year:

  • The reranker isn't a feature with a launch date. It shipped behind a flag and ramped over six weeks because nobody trusted any single-day cutover with stochastic output. There is no banner moment. The dashboards show a smooth curve that bends slightly upward.
  • The eval harness isn't a system with QPS and latency numbers. It's a CI-time tool nobody outside the team uses directly. It has no production footprint. Its impact is the velocity of every other engineer on the team — exactly the kind of second-order leverage the rubric struggles to weigh.
  • The conversion-lifting prompt was authored by three people across two months. Two of them have already moved teams. The attribution conversation, if anyone bothers to have it, devolves into "they all helped."
  • The counterfactual question — "what would have happened without you?" — has no clean answer in a stochastic system. The committee asks it expecting a number. The honest answer is "the curve would have been somewhere lower, by an amount we'd have to re-run the A/B to estimate."

The committee does what committees do under ambiguity: they default to the framing the packet uses. If the packet says "improved retrieval quality," the committee hears "tweaked some hyperparameters." If the packet says "owned the eval harness," the committee hears "wrote some scripts." This is not because the work was small. It's because the packet didn't translate the work into the committee's currency.

Frame the Artifact as the System, Not the Model

The single largest leverage point in an AI engineering packet is the framing of what was shipped. Engineers default to naming the model — the reranker, the prompt, the fine-tune — because that's what they spent their time inside. Committees parse the packet looking for systems.

Compare:

  • "I improved eval coverage."
  • "I built the eval harness that turned prompt changes from a two-week QA cycle into a one-hour CI gate."

The work is the same. The first reads as housekeeping. The second is a system, with a before-and-after, that the committee can place against any other infrastructure project they've ever calibrated.

Compare:

  • "I improved the system prompt."
  • "The prompt change required isolating one of seventeen interacting instructions whose effects only show up at long context lengths — every prior attempt had moved the wrong instruction and regressed an upstream metric."

The first sounds like a typo fix. The second names a difficulty curve other engineers had hit and you crossed. That's the language calibration committees actually use to draw senior-vs-staff lines.

The discipline is to write the packet for a reader who has never seen AI work calibrated before. Don't assume they know that "shipped a reranker" implies an offline eval harness, an online A/B, a feature flag, a rollback playbook, a cost analysis, and a production monitoring story. List those things. Each is an artifact a committee already knows how to value.

Quantify Counterfactuals With A/B Rigor

The committee's counterfactual instinct — "what would have happened without you?" — is sharper than it looks. They are not asking for hypotheticals. They are asking for a falsifiable comparison. Product teams answer this question every quarter using A/B tests. AI engineers should hold themselves to the same standard.

The weak version: "the new reranker improved retrieval quality."

The strong version: "the reranker shipped behind a flag with a 50/50 split for three weeks against the prior model. Retrieval quality improved by 8.2 points (Hit@1 from 62.7% to 70.9%) with N=2.3M queries, p < 0.001. Downstream conversion lifted 1.4 points in the same window. We held the experiment open for an additional two weeks at 90/10 to confirm no degradation in long-tail latency before full rollout."

The strong version reads as engineering. The weak version reads as a vibe. Same work, three orders of magnitude difference in committee credibility.

The same rigor applies to eval results. "Coverage improved" is not a sentence in a senior packet. "Eval coverage on the regulated-content slice went from 41 examples to 1,820 examples spanning 14 failure modes we'd previously been blind to; three of those modes have caught regressions in CI in the last quarter that would have shipped under the old harness" is a sentence in a senior packet.

When a counterfactual genuinely is fuzzy — "we don't know what would have happened" — say so explicitly and bound it. "We don't have a clean counterfactual for the prompt change because we shipped it without a holdout, but the conversion curve broke from its prior trend the week of the change with no other concurrent shipments; the most conservative attribution is 60% of the lift, the most generous is 100%, the midpoint is the number I'm using." Committees respect engineers who name uncertainty. They distrust engineers who hide it.

Surface Second-Order Leverage Explicitly

The eval harness is the canonical example. It runs in CI. It blocks merges. It produces a number that goes into a dashboard nobody outside the team looks at. By every traditional metric, it is invisible work. By the metric that actually matters — how much faster the team ships safe AI changes — it is the load-bearing artifact of the entire quarter.

Engineers consistently bury this kind of work. They mention the harness in a sub-bullet. They list the four projects it enabled as if they were separate from it. They let the committee assume the harness happened by accident, in someone's spare time.

The translation:

  • Bury: "I also built some eval infrastructure that the team uses."
  • Surface: "The eval harness I built was the prerequisite for the next four engineers' projects this year. Three of them — the safety-classifier upgrade, the long-context routing, and the customer-support agent — have explicitly named the harness in their own packets as what made their work possible. Without it, each of those projects would have shipped without a quality gate or would have been delayed by the manual QA cycle the harness replaced."

The second framing names a chain of leverage the committee can verify by reading other packets. It also creates a positive feedback loop: every engineer downstream who credits the harness in their own packet is now an unsolicited witness in your favor.

The same logic applies to prompt libraries, golden-trace test suites, attribution dashboards, model-routing layers, and any other piece of platform work the AI team builds for itself. If you built the thing every other AI project at the company depends on, your packet should make that the headline, not the footnote.

Name the Difficulty Curve Other Engineers Hit

A consistent pattern in AI work is that the change that finally landed was the seventh attempt, and the prior six had failed in different ways. The packet that says "shipped the change" undersells the work. The packet that says "isolated the right intervention after six prior attempts had moved adjacent levers and regressed an upstream metric" tells the committee what difficulty class the work was in.

Naming the difficulty does not mean dramatizing the struggle. It means writing one sentence that a calibration committee can use to distinguish "the engineer turned a knob" from "the engineer diagnosed a non-obvious interaction in a system where most interventions make things worse." The latter is a senior-vs-staff signal in any engineering domain. AI work tends to have it in spades — the systems are interaction-heavy, the failure modes are emergent, and the wrong intervention is often indistinguishable from the right intervention until you measure it.

Concrete frames that travel well:

  • "Three teammates had attempted this and reverted; the prior failures were instructive about which axis to actually move on."
  • "The prompt has seventeen instructions that interact at long context lengths; isolating which one was responsible required a per-instruction ablation across the eval set."
  • "The fine-tune required curating a 4K-example dataset where every example was labeled by a domain expert, because the obvious synthetic-data approach had failed under our adversarial probes."

Each of these is one sentence. Each transforms a line item from "did the thing" to "crossed the difficulty bar that defines the level."

The Manager's Half of the Job

No engineer can rescue a packet from a calibration committee that has never seen AI work calibrated before. The first AI engineer up for promotion in any room is calibrating not just themselves but the room's vocabulary for every AI engineer who will follow. Their packet becomes the reference point. If it framed the work as "tuned some numbers," the next packet that uses the language of systems and counterfactuals will be read against the prior framing and look like inflation.

The manager who lets this happen is the manager whose senior AI engineer doesn't get promoted, and whose next AI engineer also doesn't get promoted, and who eventually loses both to a company whose calibration committee has done the work.

The manager's job is to pre-brief the committee. Not at the calibration meeting itself — by then it's too late — but in the weeks before, during the writing of the rubric, in the running room of how-do-we-talk-about-this-class-of-work. The brief is short:

  • AI work ships incrementally behind flags; the launch date is a ramp, not a banner.
  • Eval harnesses are platform work and should be credited like any other platform contribution.
  • Counterfactuals in stochastic systems are A/B reads, not feature-shipped/not-shipped binaries — judge them by the rigor of the experimental design, not the certainty of the conclusion.
  • Prompt and fine-tune authorship is often shared; treat it the way the rubric already treats shared codebase ownership.

A manager who walks into calibration without having done this groundwork is asking the committee to invent the rubric in real time, under social pressure, against a packet that doesn't fit the template. The committee will fall back on the template. The engineer will pay the tax.

The Org-Level Realization

Career-ladder rubrics written for deterministic systems systematically under-credit AI engineering work. The work is real. The leverage is large. The artifacts don't fit the template. Committees default to the template under ambiguity. Engineers who don't translate get under-leveled. Engineers who get under-leveled leave for companies that have updated their rubric.

The arbitrage is real and it is happening now. Companies that update their calibration vocabulary keep their best AI engineers. Companies that don't watch them walk to companies that have. The packet is the proximate artifact, but the deeper failure is rubric design — and the deeper opportunity is an engineering organization that takes the time to update the rubric before the loss is visible in attrition data.

If you are the engineer up for promotion, write the packet the committee can read. If you are the manager, pre-brief the room. If you are in the room, ask the question your rubric doesn't yet have language for: what is the right currency for an artifact that ramped, that has no QPS, that other engineers depend on, that improved a probability distribution rather than shipped a feature? The teams that answer this in 2026 keep the engineers who matter. The teams that don't are running an unpriced exit interview that hasn't been scheduled yet.

References:Let's stay in touch and Follow me for more thoughts and updates