The Perf Review Template That Cannot See AI Work
Your strongest AI engineer spent the cycle curating an eval set, calibrating a judge prompt, and killing two features that turned out to be task-shape mismatched. None of that fits a single line on the review template. So the calibration meeting either inflates the artifacts the engineer cares least about — PR count, design docs, on-call shifts — or invents prose to justify a high rating the framework cannot defend. Either way, the rubric and the reality are pulling in different directions, and the engineer can tell.
The template was written for deterministic software. It rewards what you can count: lines of code shipped, services owned, incidents resolved, hours spent on-call. The AI roadmap is moved by a different shape of work: curating a representative eval slice, defending a behavioral envelope under model drift, refusing to ship a feature whose task shape doesn't fit the model, and patiently shrinking the gap between a judge prompt and human intent. Almost none of that produces the artifacts the rubric was built to count.
This is not a new problem in the abstract — every wave of engineering has had work that didn't fit the previous template. What's new is the velocity. The AI engineer who built your most-loved feature this cycle and the AI engineer who killed two features that would have shipped badly may have produced almost identical performance evidence by the rubric's accounting, and the framework cannot tell you which one to promote.
The Rubric Was Written for a Different Shape of Work
Most engineering ladders ask three questions, dressed up in different language: how complex is the system you own, how much do you ship, and how much do you mentor. The implicit physical model is that the work product is code, the code goes to production, and production behavior is deterministic enough that "it works" and "it doesn't" are the operative categories.
AI-feature work breaks every assumption in that chain.
The work product is not code — it's a behavioral envelope: an eval set that represents what users actually do, a judge that scores outputs against that envelope, a prompt that pulls model behavior into the envelope, and a release gate that holds the line as the model drifts. The code is the smallest part of it, and often the part most likely to be deleted next quarter when the next model lands.
"It works" is not binary. A feature can pass its eval suite while quietly degrading on a slice the eval set didn't represent. A feature can fail its eval suite because the judge calibration drifted, not because the underlying behavior changed. A feature can ship at a score the team agreed was acceptable and turn out to be unacceptable in front of users a month later. There is no "passing tests" line you can point to in the rubric.
Mentorship in AI-feature work doesn't look like teaching someone the codebase. It looks like teaching someone to read a confusion matrix as a product question, to recognize when a regression on slice A is acceptable because slice A is 0.3% of traffic, and to feel in their gut when a judge prompt has started rewarding the wrong thing. None of those skills have a line on the rubric, and most of them transfer poorly through code review.
The honest summary: a rubric trained on shipping a payments service is being asked to evaluate someone whose primary artifact is a calibrated judge. The categories don't map, and the calibration meeting has nowhere to put the work.
Peer Feedback Cannot See Eval Quality
The way most companies fill in the rubric's blanks is peer feedback — three to five colleagues write a paragraph each about what the engineer did well and where they could grow. The assumption is that peers, having worked alongside the engineer, can speak credibly to the quality of the work.
For deterministic software, this is roughly true. A peer can read a pull request, evaluate the architecture, judge the code quality, notice the absence of tests. They are competent to review the work because the work is legible in artifacts they are trained to read.
For AI-feature work, the peer is often not competent to review the work, and does not know they are not competent. Reading someone else's eval set requires being able to tell the difference between a representative sample and a flattering one. Reading someone else's judge prompt requires being able to notice the verbal tics the judge will reward. Reading someone else's kill decision requires being able to reconstruct the task-shape analysis that justified it. A peer who edits prompts confidently but has never built an eval cannot evaluate eval quality any more than someone who can use a database can evaluate index design.
What you get instead is reviewable surrogates. The peer writes about communication, responsiveness, helpfulness in design discussions, willingness to pair — the things they can observe. The actual artifacts — eval coverage, judge calibration, prompt archaeology — get a generic sentence like "owns the AI quality story for the team" that means nothing under scrutiny.
The result is a quiet inversion: the engineer whose AI work is best-loved by their peers is often the one who is most communicative about AI work, not the one whose AI work is most rigorous. Communication is real value, and it should be rewarded, but it should not be the only AI signal a perf packet contains.
Calibration Drifts Because Managers Grade Against Effort the Org Doesn't Recognize
Calibration is the meeting where managers in a band compare their reports and force the ratings into a distribution. It is also where the rubric mismatch becomes visible — and gets papered over.
The AI-feature manager comes in with a report whose evidence is eval coverage curves, judge calibration deltas, behavioral envelopes documented, features killed before launch. They argue for a high rating because they can see the work and know what it cost. The other managers in the room — who run platform teams, infra teams, product surface teams — cannot evaluate that evidence. They can only see whether the prose sounds compelling.
What happens next decides whether the AI-feature engineer gets a fair review. The honest version: the AI-feature manager learns to translate the evidence into rubric-friendly artifacts — "led a quality initiative that reduced regression incidents 40%," "owned cross-team alignment on model migration," "drove the framework for evaluating LLM features." The translation is not false. It is also not what actually happened. The actual work was sitting with three thousand traces and discovering that the judge was rewarding hedged answers on questions that needed crisp ones. There is no rubric line for that.
The less honest version: the AI-feature manager learns that the inflated story works, and the un-inflated story doesn't, and starts grading their reports against an effort the rest of the org does not recognize. The bar quietly diverges. After three cycles, the AI-feature engineer at level N has produced evidence that would not get a non-AI engineer to level N — not because the work was less real, but because the rubric never let it count in the first place. When the org notices the gap, the correction often falls on the AI engineers, who lose ground they earned doing work the framework refused to see.
The way out is not to push harder on translation. It is to fix the rubric.
What the Template Has to Add to See This Work
A review template is a policy document. What you ask for is what you reward, and what you don't ask for fades. If the template can be amended deliberately — not as a one-off carve-out, but as a permanent extension — these are the lines that need to exist.
Eval ownership and quality. Not "did this engineer write evals," which is a checkbox anyone can pass. The actual question: are the eval sets they own representative of production traffic, do the failure cases they include actually correspond to user-visible regressions, do the slices isolate the behaviors leadership cares about, and would another engineer be able to use the suite to triage a real incident without rebuilding context. This is the prompt-engineering analogue of "owns the system end-to-end." It is the work that prevents shipped quality from drifting.
Prompt and rubric provenance. When the engineer changed a prompt, can you tell from the artifact why, what the alternatives were, and what was learned in the iteration? When the engineer changed a judge rubric, can you reconstruct the calibration data behind the change? Prompt archaeology — the ability to read someone else's prompt history the way you read code history — is the version control discipline of AI work, and engineers who do it well are doing leadership-grade work even at mid-levels.
Model-migration leadership. A model migration is the AI equivalent of a database migration: silent regressions, long tail of edge cases, asymmetric downside, and most of the work is invisible until it isn't. The engineer who quietly led the team through three model upgrades without a customer-visible incident has done load-bearing work that deserves a line, separate from the "ships features" line.
Kill decisions defended with data. The engineer who killed two features this cycle saved you launches you would have regretted. If your rubric only rewards shipped features, you have built a framework that punishes the most important judgment your AI team exercises. A line for kill decisions, with the evidence that supported them — eval suite said no, task shape did not fit, behavior was unrecoverable with the available models — is how you reward refusal as a competence.
Sustained-quality custodianship. The first six weeks after launch are when most AI features quietly degrade — eval drift, judge drift, prompt rot from model upgrades, traffic shifts that violate the implicit assumptions of the eval set. The engineer who treats post-launch quality as ongoing work rather than a closed ticket is doing the work that turns demos into durable products. There needs to be a line for it.
You will notice these lines do not require new evidence — they require the rubric to ask for evidence the engineer is already producing. The artifacts exist. The template just has no slot to receive them.
The Leadership Frame
The architectural realization is uncomfortable: the team you built to ship AI features is being measured against a framework that does not yet know how to see them. Every cycle the rubric stays unchanged, the gap widens. The engineers who can translate their work into the legacy template get promoted. The engineers who do the deepest AI work but cannot or will not translate get under-leveled. The bar in your AI org quietly diverges from the bar in the rest of engineering, and you do not find out until someone leaves and the work breaks.
This is not solved by hiring an "AI rubric consultant" or appending a four-bullet AI Addendum to the existing template. It is solved by treating the rubric as a product surface that needs to be iterated on as fast as the work it is measuring. The first time you do a calibration with the new lines in place, the discussion will be harder — managers without AI reports will push back, the rubric language will need revisions, the bar at each level will have to be re-anchored. That difficulty is the rubric finally doing its job.
The rubric is not the enemy. The version of the rubric that pretended AI work fit the old slots was the enemy, and it survived because the alternative was uncomfortable. The version of the rubric that names the work explicitly is the one that lets the people doing the work be seen, paid, and promoted for it. That is the only version that scales past the first cycle.
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://newsletter.pragmaticengineer.com/p/evals
- https://review.firstround.com/one-rubric-changed-boxs-engineering-performance-heres-how/
- https://blog.pragmaticengineer.com/performance-reviews-for-software-engineers/
- https://www.devopsschool.com/blog/principal-prompt-engineer-role-blueprint-responsibilities-skills-kpis-and-career-path/
- https://www.getmaxim.ai/articles/prompt-evaluation-frameworks-measuring-quality-consistency-and-cost-at-scale/
