The Promotion Packet for AI Engineers Who Didn't Ship a Feature
The AI engineer with the strongest case for promotion on your team has a promotion packet that looks empty. Two quarters of work and the impact graph is a flat line. The eval-regression rate that used to spike to 12% on every model swap now sits at 4%. The $40k/month cost spike that finance was about to escalate never reached finance because somebody added a budget guard to the gateway. The P0 incident that would have made the company's status page never happened because a kill-switch tripped and routed traffic to the previous prompt version.
The packet has nothing to write in the "shipped X" column. The calibration committee sits down with two engineers side by side: one who shipped two visible features this half, one who quietly absorbed the load that made those features possible. The committee, doing what it has always done, rates the shipper higher. The infra-shaped engineer either takes a "meets expectations" rating they don't deserve and quits inside a quarter, or learns to write the packet in a language the committee can actually read.
This post is about the second option. It's a working theory of how to write the AI engineering promotion case when the impact lives in negative space — incidents that didn't happen, costs that didn't materialize, regressions that never reached users — and a parallel theory of why the calibration rubric written for SaaS-era feature shipping is now the single biggest org bug in companies running production AI.
The shape of the impact you can't put in a packet
Every traditional impact frame the calibration committee knows how to score has a positive delta on the y-axis. Revenue went up. Latency went down. A new product surface launched. Adoption climbed by 14%. The engineers who do that work walk into the room with a chart and the chart goes the right way and the committee nods.
The AI engineer's chart goes flat. Or, more accurately, it goes flat at a level it was never going to reach without their work, and the committee can't see the missing alternate timeline. Consider the actual portfolio of an AI engineer carrying a quarterly load:
- Eval regression rate dropped from 12% per model swap to 4%. The committee sees a flat eval pass rate. The engineer sees: the team can now do model upgrades twice as often, every product launch ships against a known baseline, and the customer-visible incident count from "the model got worse" went to zero.
- Cost per active user stayed flat against a 3x growth in feature complexity. The committee sees a flat unit economics line. The engineer sees: caching, prompt compression, model routing, and prefill optimization, each of which would have produced a $10k/month bill increase, all delivered in a single half against a backdrop of three new features being shipped on top.
- Incident count stayed at one per quarter. The committee sees no change. The engineer sees: an automated rollback for prompt deploys, a per-tool blast-radius cap that contained a misbehaving agent before it ran a destructive operation, a circuit breaker that saved the team from a vendor outage they didn't even notice.
- Tail latency held at 2.1s p99 through a doubling of traffic. The committee sees no improvement. The engineer sees: a retry policy rewrite, a connection pool tuning pass, a streaming-first refactor of the hottest endpoint, all of which prevented the latency cliff that competitors hit publicly that quarter.
The pattern is consistent. The work compounds, the load it absorbed scaled, the alternate timeline got measurably worse for everyone else who didn't do this work, and the chart on the slide is a flat line. Reading flat as "did nothing" is the calibration bug.
Counterfactual is the only honest unit of impact
Outside of engineering promotion processes, mature fields already have a name for the right way to measure this kind of work. Counterfactual impact: the difference between what actually happened and what would have happened in the absence of the intervention. Public health uses it to evaluate vaccines. Economics uses it to evaluate policy. Operations research uses it to evaluate process changes. The whole point is that prevention work cannot be measured by direct observation, because the thing it prevented is by definition not in the data.
In engineering promotion, counterfactual is treated as suspect. "How do we know the incident would have happened without you?" is the polite version of the calibration committee's discomfort with the frame. The answer is the same answer epidemiology gives: build the counterfactual carefully, anchor it to comparable units, and quantify the gap.
For an AI engineering promotion narrative, the comparable units are usually right there:
- Other features in the company that didn't get the same investment. If the recommender team's eval regression rate is 11% and yours is 4%, the gap is the counterfactual. If the search team's incident count this half is six and yours is one, the gap is the counterfactual. The committee doesn't need a fictional alternate universe — they need a peer team they're already calibrating other engineers against.
- Your own team's metrics from the prior period. If cost per active user was trending +18% quarter over quarter before this work and it's flat now, the counterfactual is the projected line. Walk the committee through the regression: "Here's the trend from before. Here's where the trend would have put us. Here's where we actually landed. The delta is $52k/month, annualizing to $624k."
- External baselines from comparable companies. If three competitors had publicly disclosed AI incidents in the same period and yours did not, that's a counterfactual the committee can verify themselves.
The discipline is to write each of these into the packet as an explicit comparison, not as an implicit claim. "Eval regression rate dropped to 4%" is not the impact statement. "Eval regression rate dropped to 4% versus a peer-team average of 11%, eliminating an estimated 8 customer-facing model degradations this half" is the impact statement. The first one is invisible. The second one is the exact frame the committee uses for shipped-feature engineers, just pointed at prevented harm instead of delivered features.
Infrastructure as compounding leverage, with the receipts attached
The second move in the packet is to stop framing infrastructure work as infrastructure. Frame it as the precondition for every shipped feature in its blast radius, with a named list of those features and the specific way each of them depended on the infrastructure work.
The shipping engineer wrote: "Shipped the new conversation export feature." The supporting engineer's packet should not say "Built export-friendly streaming primitives." It should say: "Built export-friendly streaming primitives that the conversation export feature, the analyst summary view, and the bulk-data API ship-as-promised dates depended on. Without this work, the conversation export ship date moves out by an estimated 4–6 weeks, the analyst summary view requires a separate streaming implementation, and the bulk-data API blocks on a different team's roadmap."
This is not name-dropping. It is correctly attributing the load-bearing dependency. The committee's instinct is to credit the shipper because the shipper is whose name is on the launch announcement. The packet's job is to show that three launches share a foundation, and the foundation has one author.
The eval and observability story should land the same way. Frame the eval suite as the moat that lets next quarter's features ship faster, and prove it with a concrete forecast. "The eval suite I built reduced the average time from prompt change to confident deploy from 11 days to 2 days. Three product launches scheduled for next half are budgeted at 11 days each in the original plan; the actual cost will be 2 days each. That's 27 engineering-days returned to feature work next half, equivalent to one additional feature shipping that wouldn't have." Calibration committees love numbers like that. The trick is that you have to do the math before walking in.
Same move for kill-switches and rollback automation. "The rollback automation reduced mean time to recover from 47 minutes to 4 minutes. At our current incident rate that's roughly 6 hours of downtime avoided this half. At our gross margin and active-user count that's $X of revenue protected." Either you write that math or the committee writes a flat impact line by default. The math is yours to do.
The manager translation problem
Even with a perfectly written packet, the engineer who writes it cannot get themselves promoted. Calibration is a peer-to-peer process and the packet is filtered through a manager who has to defend it in a room of other managers. If the manager cannot translate non-feature impact into the language the rest of the room speaks, the packet dies in calibration regardless of what's in it.
The manager's job in this room is harder than the manager's job for a feature-shipping engineer. For the shipper, the manager points at the launch and the room nods. For the AI infrastructure engineer, the manager has to walk the room through the counterfactual frame, defend it against "but how do we know," and re-anchor the rubric every single time. Doing this well is itself a leadership skill, and managers who can't do it should not be running AI teams. The cost of getting it wrong is not theoretical: the engineers your AI program load-bears on are the engineers most easily poached, because every other AI team is hiring for exactly that profile. The manager who lets a fair-impact engineer take a "meets expectations" rating is functionally writing a referral letter to a competitor.
The translation pattern that tends to land in the calibration room:
- Open with the counterfactual. "Without this work, our half would have looked like X." X is concrete: a specific incident, a specific cost line, a specific feature ship date that would have slipped.
- Anchor against a peer team. "Team Y, who didn't make this investment, hit X this half." The committee already calibrates against Team Y's engineers; now they're calibrating against Team Y's outcomes too.
- Quantify the dependency graph. "Three features the committee already credited to other engineers depended on this work." The shipping engineers' impact does not get diminished. The supporting engineer's impact gets correctly added to the total.
- Forecast the next half's leverage. "This investment compounds. Next half's roadmap has N features that ship faster because of it." The packet is not just retrospective; it sets expectations for what the engineer will be credited for in the next cycle.
If the manager walks in with those four moves rehearsed, the calibration outcome usually flips. If they walk in with "I think they did really good infrastructure work," it does not.
The rubric is the bug
The deeper issue is that none of this is the engineer's fault, and none of this is the manager's fault either. The leveling rubric most companies are using was written when "shipping a feature" meant "writing the code that becomes the user-visible product surface." In the AI era that's no longer true. The user-visible product surface is a thin shell on top of a stack — eval suites, observability, cost guards, prompt management, model routers, kill-switches, rollback automation — that is doing most of the load-bearing work. The engineers maintaining that stack are the reason the surface is reliable. Treating them as second-class because their work doesn't translate to a launch announcement is an organizational pricing error.
The companies that fix the rubric earliest will end up with the AI engineers everyone else is trying to hire. The signal is not subtle: when an AI org's calibration cycle ends and the engineers who left in the next quarter are the eval and infrastructure people, the rubric was the cause and the rubric is the fix. The shipped-feature proxy was acceptable when shipping was the bottleneck. In production AI, shipping is not the bottleneck — keeping the thing reliable, evaluable, and within budget is. The rubric should reflect that, and any leadership team running AI at scale should treat fixing the rubric as a Q1 priority on par with any technical bet.
For engineers reading this who are about to enter a calibration cycle without a feature to point at: the work is not invisible because the work doesn't matter. It's invisible because the packet is written wrong and the rubric is wrong. Neither of those is a verdict on the engineer. The packet is a writing exercise that has a known answer: counterfactual impact, peer-team comparison, dependency graph, forward leverage. Do the math, run the comparisons, name the features that depended on you, and forecast the leverage you'll generate next half. The flat line is not a flat impact. It's a chart that needs a different y-axis.
- https://rafaelcp.com/posts/2023-01-09-Helping-software-engineers-with-their-promotions
- https://www.seangoedecke.com/staff-engineer-promotions/
- https://philosopherdeveloper.com/posts/the-case-for-writing-your-own-promo
- https://www.infoq.com/articles/staff-engineers-impact-incidents/
- https://probablygood.org/core-concepts/counterfactual-impact/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://arxiv.org/html/2405.10473
- https://www.devopsschool.com/blog/staff-systems-reliability-engineer-role-blueprint-responsibilities-skills-kpis-and-career-path/
