The Three Tastes of an AI Engineer: Why Prompts, Evals, and Guardrails Don't Live in the Same Head

April 27, 2026 · 11 min read

Software Engineer

The three best AI engineers I have hired this year would all fail each other's interviews. The one who writes prompts that survive a model upgrade has never written a useful eval case in her life. The one who designs eval sets that catch the failures that matter writes prompts that other engineers refuse to extend. The one who designs guardrails that fail closed without choking the happy path has opinions about the other two that I cannot print here.

The job ladder calls all three of them "AI engineer." The calibration committee compares their promo packets as if they had been doing the same job. They have not.

What follows is the argument that "AI engineering" is not one skill but at least three, that the three skills draw on different intuitions and reward different reflexes, and that hiring or promoting as if they were the same produces lopsided systems where every artifact is green and the user-visible quality is sliding into a ditch nobody owns.

Three Skills That Look Identical From the Outside

From the outside, all three engineers spend their days in a notebook, an eval dashboard, and a chat window with a model. The artifacts they produce — a prompt, an eval set, a guardrail layer — even live in adjacent files in the same repo. Hiring managers who have never built any of the three will tell you they are "the same job, different lenses." They are not. They are three separate jobs that happen to share a tool surface.

Prompt taste is the intuition that lets you write instructions a model will follow today and continue to follow when the underlying weights change next quarter. It includes a working theory of how the model interprets ambiguity, how it weights examples against system instructions, and which constructions are stable across versions versus which ones are clever scaffolding around a specific checkpoint's quirks. The senior prompt engineer reads a misbehaving prompt and immediately knows three ablations to try; the junior one rewrites the whole thing from scratch and calls the result an improvement.

Eval taste is the intuition that lets you write test cases that catch the failures that actually hurt users, distinguish signal from sampling noise, and recognize when a metric has detached from what it was supposed to measure. The senior eval engineer looks at a 200-case test suite and immediately knows which fifteen cases are doing all the work and which are noise; the junior one will defend every test case on the grounds that it once caught a bug. These are not the same instinct as writing prompts. The 2026 industry norm of three-layer evals — automated metrics, LLM-as-judge, human review — only works when somebody has the taste to decide what each layer should and should not be asked to measure. That decision is upstream of any framework.

Guardrail taste is the intuition that lets you design safety layers that fail closed on the cases where uptime is worth less than a wrong answer, and degrade gracefully on the cases where the opposite is true. It is the discipline of asking "what does this layer do when the moderation API is down" before shipping the moderation API. The senior guardrail engineer will tell you, before you ask, which of the four layers between the user and the model fail open and which fail closed and why; the junior one will tell you the layers exist.

The three intuitions do not transfer. A great prompt engineer who writes their own evals will tend to write evals that confirm their prompt is good. A great eval engineer who writes their own guardrails will tend to write guardrails that pass their own test suite and nothing else. A great guardrail engineer who writes their own prompts will write the most cautious, least useful prompt that legal will let them ship. Each instinct, applied to a neighboring artifact, produces a recognizable failure mode.

What Lopsided Hiring Actually Ships

When teams hire for one taste and assume the others come along, the result is a recognizable archetype of broken system, and you can usually guess which taste is missing within ten minutes of looking at the metrics.

Great prompts on a benchmark that scores the wrong thing. The prompt is beautiful, the dashboard is green, and the user-visible quality has been declining for two months. The eval set was written by the prompt author, who optimized for the cases the prompt was already good at. The metric that would have caught the regression — say, faithfulness on the long-tail of follow-up questions — does not exist because nobody on the team has the eval taste to know it should.

Great evals on a prompt nobody can extend. The eval suite is enviable: production-mined cases, adversarial inputs, statistical confidence intervals, the lot. The prompt that the suite is testing is a 2,000-token monolith that hasn't been touched in five months because the only person who understands its construction left, and the next engineer who tries to add a new behavior breaks four eval cases for reasons that look unrelated. The team has bought regression coverage at the cost of extensibility.

Great guardrails on a feature that already passed safety review at the wrong layer. The guardrail engineer designed a beautiful PII redaction layer at the model output stage. The PII was already in the retrieval index because the data ingestion pipeline did not redact at write time. The guardrail catches the leakage that the engineer designed it to catch, and misses the leakage that actually happens, because the layer is in the wrong part of the stack. Layer placement is a guardrail-taste question, and the team did not have the person to ask it.

In all three failure modes, every artifact owner can show their dashboard is green. The user, who experiences the integrated system, cannot show anyone anything; they just leave.

The Interview Signals That Actually Filter

Most AI engineering interviews in 2026 still test the surface skill — can the candidate write a prompt, can they describe an eval framework, can they list guardrail patterns. None of those are taste tests. They are vocabulary tests, and a candidate with three weekends and an LLM tutor will pass them. What you want to filter for is the underlying intuition, and the way to filter for taste is to give the candidate broken artifacts and ask them to act.

For prompt taste, give the candidate a misbehaving prompt and ask them to ablate it. Do not ask them to fix it. Ask them which three things they would change one at a time, in what order, and what they would expect each change to do. The candidate with prompt taste will name a hypothesis for each ablation; the candidate without it will rewrite the whole prompt and call the rewrite an improvement. The discriminating signal is whether they can hold a model of the model in their head and reason about what each construction is doing.

For eval taste, give the candidate an eval set with thirty cases and ask which test they would delete. The candidate with eval taste will identify a case that is either redundant with another, measuring noise, or measuring a behavior the product no longer cares about, and explain why. The candidate without it will refuse to delete anything, because every test "could catch a bug." Coverage maximalism is a tell; ruthless pruning is the skill.

For guardrail taste, give the candidate a tool catalog with seven tools and ask which composition they would refuse to allow the agent to perform. The candidate with guardrail taste will identify a chain — read-from-untrusted-source plus write-to-privileged-sink, or read-private-data plus exfiltrate-via-side-channel — and articulate the threat model. The candidate without it will say "all the tools have been individually reviewed." The discriminating signal is whether they think about composition rather than enumeration.

None of these prompts are gotchas. Each is a five-minute conversation. The signal they produce is not "did the candidate get the right answer" but "does the candidate have a vocabulary for thinking about this artifact at all." Once you watch a few candidates work through them, the taste differences become embarrassingly easy to see.

A Rotation That Surfaces the Gap

The hiring filter only protects against the worst case. The harder problem is that engineers already on the team have asymmetric strengths, and the team usually doesn't know which engineer's taste lives where until something breaks. The cheapest way to find out is to make every senior IC own one of each kind of artifact in their first quarter on the team.

Not "be consulted on" — own. The prompt author commits the eval set. The eval owner ships a guardrail change. The guardrail engineer rewrites a system prompt. Each rotation produces an artifact the engineer's strength does not protect them from, and the team gets to watch how each engineer performs when they are working without their dominant intuition.

The point is not to make every engineer equally good at all three. That is unrealistic and not the goal. The point is to surface, in low-stakes work, the places where each engineer's instincts produce predictable mistakes, so that when the engineer is later trusted with a real artifact in their weak area, the team knows to staff a second pair of eyes from a complementary strength. Calibration committees that have done this rotation can answer "where does this engineer's taste actually live" in a way they otherwise cannot.

A useful side effect: the rotation also defuses the implicit hierarchy that prompt work is "real engineering" and eval or guardrail work is "support." Once everyone has shipped an eval miss and a fail-open incident, the respect for the other two tastes goes up sharply, and the cross-team handoffs get less defensive.

The Job Ladder Is Lying To Calibration

The deepest cost of treating the three tastes as one skill is paid in the calibration room. A senior promo packet is supposed to demonstrate impact, judgment, and craft. When the ladder has a single "AI engineer" track, the packet's evidence is in whatever currency the candidate happens to traffic in: the prompt engineer brings model-upgrade survival stories, the eval engineer brings regression-caught counts, the guardrail engineer brings an incident-prevented narrative. None of those is wrong. None of them is comparable.

The committee, faced with three packets in three different currencies, falls back on whichever currency the most senior committee member personally uses. If the senior person on the panel is a prompt engineer, the eval engineer's packet will read as "didn't ship a feature this half." If the senior person is a guardrail engineer, the prompt engineer's packet will read as "shipped fast, no incident discipline." The promotion outcomes track the panel composition more than the candidates, and the candidates correctly notice.

The fix is not to split "AI engineer" into three separate ladders. That over-corrects, calcifies the specializations, and makes cross-rotation impossible. The fix is to write the ladder so the rubric explicitly names the three currencies and asks the packet author to declare which one their evidence is in, then asks the committee to evaluate the evidence in its declared currency rather than the panel's default one. The committee's job becomes "is this evidence strong in the currency it claims to be in" rather than "does this evidence look like the kind of evidence I would have brought." That is a small editorial change with large calibration consequences.

The ladder change also forces leadership to make explicit the team's portfolio. If half the senior engineers are evaluating in prompt currency and the team has no senior eval owner, the ladder makes that visible at calibration time, not eighteen months later when the eval coverage gap becomes a quality incident.

The Org Takeaway

"AI engineering" is not one skill. The job ladder that pretends it is will produce engineers who are strong on one axis, whose blind spots compound at the seams between the three artifacts, and whose teams ship systems where every dashboard is green and the user is leaving. The hiring filter, the rotation, and the calibration rubric are three independent corrections, and a team that does any one of them will see results; a team that does all three will start noticing the seam bugs before the user does.

If you take one thing from this: the next time you write a job description for an "AI engineer," write three of them. Then look at the team you have and the team you wish you had, and notice which two of the three you have been quietly under-hiring for.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Three Tastes of an AI Engineer: Why Prompts, Evals, and Guardrails Don't Live in the Same Head

Three Skills That Look Identical From the Outside

What Lopsided Hiring Actually Ships

The Interview Signals That Actually Filter

A Rotation That Surfaces the Gap

The Job Ladder Is Lying To Calibration

The Org Takeaway

Recommended Reading

About Tian Pan

Three Skills That Look Identical From the Outside​

What Lopsided Hiring Actually Ships​

The Interview Signals That Actually Filter​

A Rotation That Surfaces the Gap​

The Job Ladder Is Lying To Calibration​

The Org Takeaway​

Recommended Reading

About Tian Pan

Three Skills That Look Identical From the Outside

What Lopsided Hiring Actually Ships

The Interview Signals That Actually Filter

A Rotation That Surfaces the Gap

The Job Ladder Is Lying To Calibration

The Org Takeaway