Skip to main content

Hiring for AI Roles That Have No Career Ladder Yet

· 9 min read
Tian Pan
Software Engineer

You open a requisition for an "eval engineer." A week later your recruiter asks the obvious question: what level is this, and what does a good resume look like? You don't have an answer. The title didn't exist two years ago. There is no leveling rubric, no canonical interview loop, no pool of people with the words "eval engineer" already on their LinkedIn. You are hiring for a job the industry has not agreed exists.

This is the quiet bottleneck in shipping AI systems. The model is available. The infrastructure is rentable. What you cannot buy off the shelf is the person whose actual job is keeping a prompt-driven system honest — and your hiring machinery, built for roles with decades of precedent, has no slot for them.

The instinct is to wait. Wait for the title to standardize, for the bootcamps to mint candidates, for someone else to write the leveling guide you can copy. That instinct is wrong. The work exists now whether or not the title does, and the teams staffing it now are the ones learning what "good" looks like before their competitors even open the req.

The roles are real even when the titles are noise

Walk through any company that shipped an agent system in 2025 and you will find the same unstaffed work. Someone needs to own evaluation: deciding whether the new prompt is actually better than the old one, building the harness that answers that question, and keeping it from rotting. Someone needs to own the prompt surface itself — the sprawling collection of system prompts, tool descriptions, and few-shot examples that no single person currently understands. Someone needs to own reliability: the retries, the timeouts, the model-version pins, the slow drift in output quality that no dashboard is watching.

Job boards have invented a dozen names for these: eval engineer, agent reliability engineer, context engineer, prompt-systems owner, AI reliability engineer, trust engineer. The titles are noise. They are marketing labels applied after the fact, and they do not translate cleanly between companies. Do not hire the title. Hire the unstaffed work.

The practical move is to write the job around a concrete failure you are already living with. Not "eval engineer, level TBD" but "the person who can tell us, with evidence, whether shipping this prompt change will make our support agent worse." That framing does two things. It tells the recruiter what to screen for, and it tells you whether the role is even a full-time job yet or a responsibility you should bolt onto an existing engineer for two quarters. Plenty of these roles start as the latter. That is fine. What is not fine is leaving the work owned by nobody because the title felt premature.

Screen for judgment and taste, not a checklist of tools

The reflex when hiring for an unfamiliar role is to anchor on tools. You write a job description demanding experience with a specific eval framework, a specific tracing platform, a specific agent library. This filters hard — and it filters for exactly the wrong thing.

The tools in this space have a shelf life measured in months. The framework you list today will be deprecated before the new hire's first performance review. Worse, requiring named tools selects for people who chase tooling rather than people who understand the underlying problem. Someone who has built an evaluation harness from scratch with nothing but a spreadsheet and stubbornness will outperform someone who can configure a fashionable platform but cannot tell you what to measure.

What you are actually screening for is judgment and taste. In an eval-focused role, the prompt is the easy part. The hard part is knowing whether the new version is genuinely better than the old one — choosing the right metric, designing a test set that mirrors production, recognizing when a benchmark has gone stale, and resisting the number that looks good but means nothing. That is taste. It does not show up on a resume as a keyword.

So interview for it directly. Give the candidate a real, messy artifact — a flawed eval set, a prompt that passes its tests but fails in production, a metric that improved while users complained — and ask them to critique it. Watch whether they verify or assume, whether they ask what the system is actually for before proposing a fix, whether they can articulate the tradeoff between two imperfect options. Strong candidates treat the AI's output as an intern's work to be checked, not an oracle's verdict to be trusted. That instinct — skeptical, detail-oriented, allergic to unexamined numbers — is the whole job. A checklist of tool names tells you none of it.

The best candidates arrive sideways

Here is the part that breaks most hiring pipelines: the strongest people for these roles often do not come from machine learning.

The reflex is to route every AI req to ML engineers and researchers. But most of this work is not modeling. Nobody on an eval or reliability team is training a foundation model. They are designing measurement systems, building test infrastructure, reasoning about failure modes, and treating a nondeterministic black box as something to be characterized and contained. Those are not ML skills. They are QA skills, data-engineering skills, and platform-engineering skills.

Think about who already does the core of an eval engineer's job. A senior QA engineer has spent a career deciding what to test, building harnesses, and distinguishing a real regression from noise — they just have not pointed those skills at a language model yet. A data engineer who has wrangled dirty datasets and built pipelines understands eval-set construction and drift better than most researchers. A platform engineer who has run production systems already thinks in retries, timeouts, version pinning, and graceful degradation — exactly the agent-reliability skill set, applied to a new kind of dependency.

These people apply for your AI roles and get filtered out in the first pass because their resume says "QA" instead of "ML." The fix is structural. Tell your recruiter explicitly to source from QA, data, and platform backgrounds, not just ML. Strip the modeling jargon out of the job description so it does not scare off the right people. And add one screening question that settles it fast: ask the candidate to design an evaluation for a system they have never seen. The ML researcher who has never shipped a product will often flounder. The QA lead will light up. Hire the one who lights up.

Build the leveling story before the first hire, not after

The most expensive mistake is not a bad hire. It is hiring a great one into a role with no future.

Because the title is new, it has no level. Because it has no level, it does not appear in your promotion criteria, your compensation bands, or your calibration meetings. The new hire does excellent, load-bearing work — and then comes review season, and there is no rubric that describes what they did, no peer group to calibrate against, and no obvious next step. They get a vague rating and a vague raise. Within a year they leave, because a role with no ladder is a job with a ceiling, and good engineers can feel a ceiling.

You prevent this before you open the req, not after. Map the new role onto your existing ladder. You almost certainly do not need a parallel hierarchy — engineering ladders already run from junior to staff and beyond, with levels defined by scope, autonomy, and impact rather than by domain. An agent-reliability engineer is a reliability engineer; the impact is just measured in eval coverage and incident reduction instead of feature delivery. An eval-systems owner is a senior engineer whose surface area is measurement infrastructure. Write one paragraph per level describing what the work looks like at that scope, and you have a leveling story.

Then make sure the leveling story is legible to the people who never touch the system. The manager running calibration needs to understand what "good" looks like for a role they have never managed. The promotion committee needs a sentence they can defend. If the only person who can explain the new hire's impact is the new hire, the role is structurally unpromotable, and you have built a trap. Spend the hour to write the rubric. It is far cheaper than re-running the search in eighteen months.

Staff the work now

The pattern across all of this is the same: do not wait for the industry to catch up before you act.

The title will standardize eventually. The leveling guides will get written, the interview loops will converge, the bootcamps will mint candidates with the right words on their resumes. When that happens, hiring for these roles becomes easy and competitive and expensive — like hiring for any mature specialty. The advantage available right now is that it is none of those things yet.

So define the role around a failure you are already living with. Screen for judgment over a tooling checklist. Source sideways from QA, data, and platform, not just ML. And write the leveling story before you make the offer, so the person you hire has somewhere to go. The teams doing this today are not just filling seats. They are accumulating the institutional knowledge of what these roles actually are — and by the time the rest of the industry agrees on the title, they will already know how to hire, level, and grow the people behind it.

The work is real now. Staff it now.

References:Let's stay in touch and Follow me for more thoughts and updates