The AI Engineer Interview Is Broken: Stop Testing Implementation, Start Probing Eval-Design
A team I worked with last quarter rejected three candidates in a row from their AI engineer pipeline. All three failed the coding screen — the kind of problem where you implement a sliding-window deduplicator under a 35-minute timer. The team then hired the candidate who passed it. Four months later that engineer was the one who shipped the feature where the eval scored 92% in CI and the support queue lit up the day after launch. The eval was measuring exact-match against a curated test set. Production users phrased their queries differently. Nobody on the hiring panel had asked the candidate how they would have caught that gap.
That's the shape of the bug. The interview pipeline was screening for the skill that mattered least to the job and was blind to the skill that mattered most. The team did not have a "judgment" round. They had a coding round, a system-design round, and a behavioral round, and they were running the same loop they had run in 2021 — the one calibrated for engineers who were going to write deterministic code against stable libraries.
