The 40-Point Gap Between Your Interviewers When the Candidate Says 'I'd Just Prompt It'
The candidate hit the wall on the system-design question, paused for two seconds, and said: "I'd just prompt it." Your most senior interviewer wrote strong hire — this is exactly how good engineers work in 2026. Your second-most-senior interviewer wrote no hire — handing the problem to a chatbot is not engineering. Same five words. Same forty-minute window. A forty-point gap on the same scorecard.
The candidate didn't fail your loop. Your loop failed to have an opinion. And the worst part of the debrief is not the disagreement — it's the way each interviewer is so confident their read is the correct one that the meeting devolves into a referendum on AI itself rather than on whether this human can ship.
This isn't a candidate-quality problem. It's a rubric-integrity problem dressed up as one, and the longer it goes uncalibrated the more your hiring bar becomes a function of which interviewers were on the panel that week instead of what the role actually requires.
The rubric you copied from 2022 is grading a job that no longer exists
The interview loop you're running was almost certainly built around a definition of engineering competence that predates the working pattern your team actually uses every day. The loop checks whether the candidate can implement a small algorithm from scratch, reason about time complexity, and walk through a system-design problem on a whiteboard. None of those checks are wrong — they're just no longer load-bearing for what an AI engineer at your company will spend Tuesday morning doing.
The Tuesday-morning reality is closer to: read a 4,000-line module that nobody on the team wrote, decide what the LLM-generated first draft of the change got wrong, push back on the parts that are subtly broken, accept the parts that are subtly correct, and own the result whether the model wrote it or not. The 2022 rubric grades the "implement from scratch" muscle that the 2026 job rarely uses, and gives no points for the "review and edit AI output" muscle that the 2026 job uses constantly.
So when a candidate skips to "I'd just prompt it" without showing the underlying reasoning, your loop has no agreed-upon way to disambiguate two very different signals: the senior who is correctly identifying that this is a solved problem the model handles well, and the junior who is hiding the fact that they cannot reason about the problem at all. Both produce the same five-word answer. Only one of them is the candidate you want to hire.
The disagreement is the data — not the noise
The standard reaction to a forty-point spread in a debrief is to argue harder, average the scores, or defer to the most senior voice. All three are wrong responses. The spread is the most valuable artifact your loop produced this week, and treating it as a vote-counting problem rather than a signal-extraction problem is how hiring bars quietly drift for years before anyone notices.
Inter-rater reliability is the boring statistical name for the thing that's broken. When structured-interview research reports inter-rater reliability climbing from around 0.37 to around 0.67 after calibration work, what it's really saying is: before calibration, your interviewers were agreeing only marginally above chance, and after calibration they were agreeing well enough that the panel's decision means something. The forty-point gap on "I'd just prompt it" is the unmistakable shape of an IRR below 0.4.
The fix isn't more interviewers. It isn't a more detailed rubric. It isn't a longer debrief. It's a calibration session where the panel sits down with the same recorded candidate answer and surfaces the reasons their scores diverged. Not "I felt strong hire" versus "I felt no hire" — but "I gave strong hire because pushing back on AI output requires the same judgment as writing it from scratch, and the candidate showed that judgment in their follow-up" versus "I gave no hire because the candidate didn't explain what they'd prompt or how they'd verify it, and I can't tell whether they have the judgment from this answer alone." Those are two different rubric items. Right now they're collapsed into one score.
What you're actually hiring for in 2026
The "AI engineer" job title is doing too much work. The same posting is being used to recruit at least three different roles that share a stack but require materially different skills, and your interview loop is likely scoring all three against the same rubric.
The first role is the product-AI engineer who composes existing models into a working product surface. They spend their time on retrieval design, prompt iteration, eval construction, latency budgets, and the unglamorous integration work of making the model behave in a specific business context. They need taste, system thinking, and the discipline to write evals before they write features.
The second role is the AI-platform engineer who builds the inference, training, or RAG infrastructure that other engineers consume. They need depth in distributed systems, observability, and the unsexy plumbing of running GPU workloads reliably. The "I'd just prompt it" answer is a red flag for this role almost regardless of who said it, because their job is to build the layer that makes prompting work at all.
The third role is the ML/research engineer who is closer to the model itself — fine-tuning, evaluation methodology, or original training work. They still need to know the math, can usually implement a transformer block from scratch, and would treat "I'd just prompt it" as an admission that the candidate has no opinion about model behavior.
If your panel is interviewing all three roles against one rubric, "I'd just prompt it" is strong hire for role one, neutral for role two, and no hire for role three. The forty-point spread isn't disagreement about the candidate. It's three different interviewers scoring against the rubric they wish they were using.
A calibration protocol that actually surfaces the disagreement before the offer
The pattern that works is uncomfortable because it requires the panel to admit they don't already agree. Most interview loops skip calibration because the senior people on it have been doing this for years and assume their judgment is the standard. The 0.37 IRR number says otherwise.
A workable protocol has four pieces. First, before the loop runs, the panel agrees on the job-shaped rubric — which of the three roles above this requisition is hiring for, and which competencies are load-bearing versus nice-to-have. The output is one page, not a deck, and the disagreements that surface during this conversation are exactly the disagreements the debrief was previously surfacing too late.
Second, the panel reviews two or three recorded or transcribed candidate sessions together and scores them independently before discussing. The point is not to agree — it's to discover where the disagreements are. A pattern of disagreement that recurs across multiple sessions (one interviewer always rates the "I'd just prompt it" answer higher than another) is a rubric ambiguity, not a personality clash.
Third, the rubric grows anchor examples for the ambiguous competencies. "Demonstrates judgment about AI output" is too abstract to score consistently. "Identifies a specific failure mode in the model's first draft and proposes a verification step before merging" is concrete enough that two interviewers will land on similar scores. The anchors are written from the disagreements surfaced in step two.
Fourth, the panel runs a periodic re-calibration on the candidates they've already hired and the ones they've already rejected. The question is not "did we get this right" — it's "would this scorecard land on the same decision today, and if not, has the rubric drifted or has the bar drifted?" When the senior interviewer who pushed for the "strong hire on I'd just prompt it" candidate is now frustrated with that hire's inability to reason about edge cases, the calibration session is where that signal becomes actionable rather than personal.
The leadership move when your own team can't agree
The hardest part of fixing this is not the protocol. It's admitting, in front of the engineers you respect, that the function whose hiring rubric you cannot defend is the function you are responsible for staffing. The instinct is to keep running the loop and hope the disagreements average out across enough panels. They don't. They drift. The bar your loop is actually enforcing is the union of whichever interviewers happened to be available that week, and the function you build over twelve months is shaped by that union rather than by any explicit decision.
The leadership move is to pause the loop long enough to run the calibration session, even though the requisition is open and the recruiter is impatient and the engineering manager wants headcount before the next OKR cycle. A panel that cannot agree on what "I'd just prompt it" means is a panel that will keep making hiring decisions on which neither the strong-hire interviewer nor the no-hire interviewer can later defend. The cost of staffing a function whose hiring rubric your own team cannot agree on is not a one-time miss — it's a year of mis-leveled offers, a year of debriefs that turn into AI-philosophy debates, and a year of post-hoc rationalization about whether the people you brought in are doing the job you needed.
The candidate's five-word answer isn't the ambiguity. Your rubric is. Fix that before you run the next loop, and the debriefs start being about candidates again.
- https://karat.com/engineering-interview-trends-2026/
- https://karat.com/resource/human-ai-technical-interview-rubrics/
- https://www.kore1.com/hire-engineers-who-use-ai/
- https://www.kore1.com/hire-ai-engineers-2026-guide/
- https://juicebox.ai/blog/rubrics-for-interviews
- https://dballona.com/calibrations-for-software-engineering-interviews
- https://www.index.dev/blog/calibrate-interview-panels-for-better-hiring
- https://www.goperfect.com/blog/interview-calibration
- https://lethain.com/bar-raising-hiring-committees-hiring-quality/
- https://www.cio.com/article/4162080/why-hiring-ai-engineers-wont-work.html
- https://stack.convex.dev/should-ai-be-used-in-coding-interviews
- https://www.metaview.ai/resources/blog/interview-rubrics
