Skip to main content

The AI Engineer Interview Is Broken: Stop Testing Implementation, Start Probing Eval-Design

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter rejected three candidates in a row from their AI engineer pipeline. All three failed the coding screen — the kind of problem where you implement a sliding-window deduplicator under a 35-minute timer. The team then hired the candidate who passed it. Four months later that engineer was the one who shipped the feature where the eval scored 92% in CI and the support queue lit up the day after launch. The eval was measuring exact-match against a curated test set. Production users phrased their queries differently. Nobody on the hiring panel had asked the candidate how they would have caught that gap.

That's the shape of the bug. The interview pipeline was screening for the skill that mattered least to the job and was blind to the skill that mattered most. The team did not have a "judgment" round. They had a coding round, a system-design round, and a behavioral round, and they were running the same loop they had run in 2021 — the one calibrated for engineers who were going to write deterministic code against stable libraries.

The day-to-day of an AI engineer in 2026 looks almost nothing like that. The work is shaped by a dozen judgment calls per week that the interview never probes: when is fine-tuning the wrong answer, how do you write an eval that won't be gamed, when do you trust the model's output enough to skip the human-in-the-loop, when do you tell a stakeholder "we shouldn't ship this feature." Implementation is the cheap part. Recognizing that the system is wrong when the metric says it's right — that's the bottleneck.

The interview is calibrated for the wrong distribution of bugs

Classical software interviews probe two skills: can you produce correct code under a tight constraint, and can you reason about a system at scale. Both are still valuable. Neither is sufficient. The bugs an AI engineer creates cluster differently than the bugs the interview is screening for.

A deterministic engineer's bugs come from incomplete mental models — off-by-one errors, missed edge cases, race conditions. The leetcode-style screen is a noisy but real proxy for that skill, and the system-design round catches the architecture-level version. The bugs an AI engineer creates come from stochastic-system mismatch — a prompt that works on the eval set and fails on a distribution shift, a tool call the model invokes confidently and incorrectly, a cost regression that nobody notices because the request count looks normal but the token count tripled.

You cannot screen for the second category by testing the first. The skills overlap less than the interview pipeline assumes. The candidate who can solve sliding-window-dedup in 35 minutes is uncorrelated with the candidate who can look at a green CI eval and ask "is this measuring the failure mode users actually hit?"

What an eval-design round actually looks like

The single highest-leverage round to add is eval-design — and the trick is that the candidate should design the eval before writing the prompt. Most interview formats invert this. They give the candidate a prompt-engineering exercise and ask "make this work better." The result is a rationalization round: candidates iterate on the prompt against an eyeballed sample of three or four examples and you learn nothing about how they would catch the failure mode in the other 99%.

Flip it. Hand the candidate a feature spec — "build the part of customer support triage that decides which queue a ticket goes to, including a confidence score" — and let them spend the first 40 minutes doing exactly two things: writing an eval, and arguing for it. Don't let them touch a prompt yet. The artifact you're scoring is the eval design. Specifically:

  • Did they identify failure modes the spec didn't name? A strong candidate will surface tail risks the spec was silent on — adversarial inputs, multilingual queries, ambiguous tickets that should go to human review, cases where the cost of a wrong routing is asymmetric.
  • Did they specify the rubric precisely enough that two graders would agree? The eval is only as good as the inter-annotator agreement on its rubric. Watch for "the output should be helpful" vs. "the output should name the correct queue from this list of seven, and an output that hedges between two queues is graded as a miss."
  • Did they think about which slice of the data the eval has to cover? A static gold set scored once is a 2023 artifact. The candidate should reach for cohort sampling — does the eval grade against the live distribution, or against a curated set that was fresh six months ago? If the latter, what's the refresh cadence?
  • Did they validate the judge against a small human-labeled set? If the eval uses an LLM judge, the candidate should know that the judge's agreement with human labels has to be measured before the judge is allowed to gate releases. Practitioners typically aim for 75–90% agreement before scaling.

The candidate who skips straight to "I'll prompt the model to output JSON" without any of this has just told you they cannot tell when their own system is wrong. That is the most expensive thing you can hire.

The trade-off interview replaces the system-design round

System design as it's currently run is a vestige round. The candidate draws boxes on a whiteboard, walks through QPS and sharding, and the interviewer scores them on whether they remembered to mention a CDN. None of this maps to the trade-offs an AI engineer makes weekly.

The replacement is a trade-off interview — present the candidate with a real conflict between two axes and watch how they reason about it. Examples:

  • Quality vs. cost. The premium model wins your eval by 3 points. It's 6× the inference cost. The PM wants to ship the premium tier. What do you ask before deciding, and what's the smallest experiment that decides it?
  • Latency vs. capability. Adding a second tool call to the agent loop fixes 40% of the failures, but adds 800ms to the median response. The product is voice. Walk me through your decision.
  • Coverage vs. drift. You have a fairness audit that passes with the model frozen at v3. Migrating to v4 unblocks three roadmap items but invalidates the audit. What do you do?

The signal you're looking for is not a "right answer" — there isn't one. The signal is whether the candidate decomposes the trade-off into measurable axes, names the experiment that would resolve the uncertainty, and surfaces the second-order effects that make the easy answer wrong. A strong candidate will refuse to answer until they've extracted constraints from the interviewer. A weak candidate will pick an axis and rationalize.

The debugging-replay round catches the taste signal

Debugging-replay is a session where the candidate is handed a real production trace — the user query, the intermediate reasoning, the tool calls, the model output, the user's eventual reaction — and asked to reason about what went wrong and what would have caught it earlier. This is closer to a code review than to a coding round, and it probes a specific skill that interview formats almost never test: can the candidate look at an output and recognize it's bad even when the metric says it's fine.

A trace is the complete record of what happened, and reasoning about a stochastic failure from a trace is the daily work of any AI engineer past the first six months of the job. Pick a trace where the failure is subtle: the model produced a confidently-wrong answer that scored high on the automated rubric, or a tool call that succeeded but was the wrong tool for the user's actual intent. Watch for whether the candidate:

  • Forms a hypothesis about the root cause and proposes the cheapest experiment to falsify it
  • Distinguishes "the model got confused" (a prompt or model problem) from "the system gave it the wrong context" (an architecture problem)
  • Notices the failure modes the existing eval would not catch — and proposes the eval extension that would have flagged it

This is the round where you find out whether the candidate has taste. Taste is not a vibe. It's accumulated judgment about what good looks like, calibrated against thousands of traces. You cannot synthesize it in a 45-minute session, but you can detect its presence or absence in about ten minutes of real reasoning.

What to drop, and why dropping it is harder than adding rounds

The eval-design, trade-off, and debugging-replay rounds are additive — they add three new signals. The harder organizational move is dropping the rounds that no longer carry their weight. Specifically: the leetcode screen and the algorithmic component of the system-design round.

This is hard because the leetcode screen is the cheapest round to run. It scales, it's defensible, candidates expect it. The system-design round is sticky for the same reason. Cutting them feels like lowering the bar. It isn't — it's removing a noisy filter that was selecting for the wrong skill while imposing a tax on the candidates you actually want.

Two patterns to consider as a transition:

  • Replace the leetcode screen with a take-home eval-design exercise. The candidate gets a feature spec and four hours. They submit an eval design (no implementation needed). You grade it on the same axes as the live round. Candidates who can't tell a good eval from a bad one self-select out.
  • Replace the system-design round with a hybrid trade-off-and-architecture session. Keep the architectural muscle — concurrency, data flow, failure modes — but anchor it to a real AI-shaped system rather than a generic "design Twitter" prompt. The candidate is reasoning about the same skills, but on a substrate that maps to the job.

The teams that resist this transition are usually the same teams that hire coding-screen-strong candidates and then complain six months later that the engineers can't tell when the model is wrong. The pipeline is the proximate cause. The hiring rubric is the root cause.

The bottleneck is judgment, and the interview should be measuring it

The structural shift underneath all of this is that implementation throughput is no longer the constraint on AI teams. Code generation is cheap. Prompt iteration is cheap. The bottleneck is the rate at which engineers can correctly judge whether the system is working — whether the eval is measuring the right thing, whether the trace shows a real failure or a curated one, whether the trade-off the team is making is the one they think they're making. That skill is the rate-limiting step on the team's velocity. It's also the skill the interview pipeline is least equipped to measure.

If you change one thing about your interview loop in the next quarter, make it the eval-design round, and put it before any coding round. The candidates who do well on that round will do well on the job. The candidates who do well on the leetcode screen and badly on the eval-design round are the ones who will ship the feature that scores 92% in CI and lights up the support queue.

The interview is the cheapest place to fix this gap. The expensive place is the production trace six months later, when somebody has to explain to the stakeholder why the eval was green when the system was wrong.

References:Let's stay in touch and Follow me for more thoughts and updates