Skip to main content

11 posts tagged with "engineering-management"

View all tags

Hiring for AI Roles That Have No Career Ladder Yet

· 9 min read
Tian Pan
Software Engineer

You open a requisition for an "eval engineer." A week later your recruiter asks the obvious question: what level is this, and what does a good resume look like? You don't have an answer. The title didn't exist two years ago. There is no leveling rubric, no canonical interview loop, no pool of people with the words "eval engineer" already on their LinkedIn. You are hiring for a job the industry has not agreed exists.

This is the quiet bottleneck in shipping AI systems. The model is available. The infrastructure is rentable. What you cannot buy off the shelf is the person whose actual job is keeping a prompt-driven system honest — and your hiring machinery, built for roles with decades of precedent, has no slot for them.

The instinct is to wait. Wait for the title to standardize, for the bootcamps to mint candidates, for someone else to write the leveling guide you can copy. That instinct is wrong. The work exists now whether or not the title does, and the teams staffing it now are the ones learning what "good" looks like before their competitors even open the req.

The Coding Interview That Agents Quietly Invalidated

· 10 min read
Tian Pan
Software Engineer

A two-hour take-home and a 45-minute algorithm round were never the point. They were proxies. The take-home stood in for "can this person ship a feature," and the whiteboard round stood in for "can this person decompose a problem under pressure." For two decades those proxies held up well enough that most teams stopped questioning them. They were cheap to administer, easy to grade, and roughly correlated with the thing you actually cared about.

Coding agents broke the correlation without breaking the format. The interview still runs. It still produces a score. The score still feels like signal. But the gap between what the interview measures and what the job requires has widened to the point where a green result certifies almost nothing — and most hiring pipelines have not noticed, because nothing visibly failed.

This is the quiet kind of invalidation. Not a process that collapsed, but a process that kept running after its assumptions stopped being true.

What a Coding Interview Measures When the Candidate Has an Agent

· 9 min read
Tian Pan
Software Engineer

The coding interview was built to isolate a single variable. Put a person in a room, give them a problem, take away their references, and watch whether they can turn the problem into working code by themselves. Everything about the format — the whiteboard, the blank editor, the prohibition on looking things up — exists to strip away collaborators and tools so you measure one isolated skill: can this person, alone, write correct code under pressure.

That skill is no longer the one the job exercises. Day-to-day engineering in 2026 is a collaboration between an engineer and an agent. The engineer decides what to build, the agent drafts most of the code, and the engineer's real work is reviewing, correcting, and deciding when the agent is confidently wrong. The interview measures solo code production. The job rewards directing a tireless, fast, occasionally hallucinating collaborator. The proxy and the target have come apart, and most hiring pipelines haven't noticed.

This is not a complaint about cheating, though cheating is the symptom everyone fixates on. It's a measurement problem. When you can no longer observe the variable your test was designed to isolate, the test stops producing signal — and a test that produces no signal while everyone still trusts it is worse than no test at all.

Onboarding an Agent Like a Junior Engineer Is a Category Error

· 9 min read
Tian Pan
Software Engineer

When an agent joins your team, the nearest analogy in every engineering manager's head is the new hire. So the playbook writes itself: give it a sandbox and read-only logs, scope the first tasks small, pair with it, expect a ramp-up period, and grow it into bigger work as trust accumulates. It feels responsible. It feels like the same patient management that turned your last junior into a senior.

It is also a category error — not a slightly imperfect analogy, but a wrong one. A junior engineer is a person who does not yet know your system. An agent is a stateless function that will never know your system, no matter how many times it touches it. Those are different kinds of things, and the management instincts that work for one quietly misallocate your attention on the other.

The reason this matters is that the metaphor doesn't just mislead — it tells you to invest in the wrong place. "Grow the agent" is not a strategy. The agent is fixed. Everything you can actually change lives outside of it.

The 14-Month Half-Life of Your Prompt Expert

· 9 min read
Tian Pan
Software Engineer

Every company shipping AI features in production has one or two engineers it cannot afford to lose, and most of them do not know who those engineers are until the resignation email arrives.

The person in question is rarely the loudest in the room. They are the one who remembers that the customer-support summarizer's tone got fixed by a three-line system-prompt edit after the Q2 escalation, that the eval suite added six cases the week the model provider quietly changed its default sampling, and that the judge calibration drifted the last time someone "cleaned up" the rubric. None of this is written down in a place a successor would find. It lives in one head, and that head is being messaged by a recruiter with a 25% raise attached roughly every two weeks.

Asymmetric Eval Economics: Why One Eval Case Costs More Than the Feature It Tests

· 9 min read
Tian Pan
Software Engineer

Here is the awkward truth most AI teams discover six months too late: a single well-designed eval case routinely costs more engineering effort than the feature it is supposed to test. A prompt edit takes an afternoon. The eval case that gives you confidence the prompt edit didn't break something takes a domain expert two days of labeling, a calibration loop with a judge prompt, and a discussion about what "correct" even means for this user surface. The feature ships in a sprint. The eval that lets you ship the next ten features safely takes a quarter to mature.

The asymmetry isn't a bug. It is the structural shape of evaluation work. Labeling, edge-case curation, judge calibration, and rubric design are upfront fixed costs that don't scale with how many features you ship — they scale with how many distinct behaviors you want to verify. Meanwhile the feature side keeps producing what feels like cheap marginal output: "another prompt iteration," "one more tool added to the agent," "swap the model." Each looks individually small. Each silently increases the surface area the eval set must cover.

The Three Tastes of an AI Engineer: Why Prompts, Evals, and Guardrails Don't Live in the Same Head

· 11 min read
Tian Pan
Software Engineer

The three best AI engineers I have hired this year would all fail each other's interviews. The one who writes prompts that survive a model upgrade has never written a useful eval case in her life. The one who designs eval sets that catch the failures that matter writes prompts that other engineers refuse to extend. The one who designs guardrails that fail closed without choking the happy path has opinions about the other two that I cannot print here.

The job ladder calls all three of them "AI engineer." The calibration committee compares their promo packets as if they had been doing the same job. They have not.

The AI Interview Collapse: Engineering Hiring Has Lost Its Signal

· 11 min read
Tian Pan
Software Engineer

The signal is gone. In a recent audit of 19,368 technical interviews, 38.5% of candidates were flagged for AI-assisted cheating, with technical roles hitting 48% and junior candidates cheating at nearly double the rate of senior ones. More damning: 61% of detected cheaters scored above the passing threshold. Without the detection layer, they would have advanced. The interview, as an instrument, is no longer measuring what it was designed to measure.

This is not a moral panic about kids these days. It is a mechanical failure of the instrument. The technical interview was calibrated for a world in which a candidate, under time pressure, in an unfamiliar environment, had to produce correct code from memory and first principles. That constraint — the thing that made the signal legible — has been dissolved by a free-tier chat window running on a second device. Every company that still runs a LeetCode-style screen is now paying to sort candidates on a test the test-taker can trivially outsource.

The AI Engineering Career Ladder: Why Your SWE Leveling Framework Is Lying to You

· 10 min read
Tian Pan
Software Engineer

A senior engineer at a mid-sized startup recently got a mediocre performance review. Their velocity was inconsistent — some weeks they shipped a ton of code, others almost nothing. Their manager, trained on traditional SWE frameworks, marked them down for output variability. Six weeks later, that engineer left for a competing team. What the manager didn't understand: the engineer's "slow" weeks were spent building evaluation infrastructure that prevented three categories of silent failures. Without it, the product would have been subtly broken in ways nobody would have noticed for months.

This pattern is playing out across engineering orgs right now. Teams that built their career ladders for deterministic software systems are applying those same frameworks to AI engineers — and systematically misidentifying their best people.

The Metrics Translation Problem: Why Technically Successful AI Projects Lose Funding

· 10 min read
Tian Pan
Software Engineer

Your model achieved 91% accuracy on the held-out test set. Latency is under 200ms at p95. You've cut the error rate by 40% compared to the previous rule-based system. By every technical measure, the project is a success. Six months later, leadership cancels it.

This is not a hypothetical. Eighty percent of AI projects fail to deliver intended business value, and the majority of those failures are not caused by model performance. They are caused by the gap between what engineers measure and what decision-makers understand. The technical team speaks a language that executives cannot evaluate — and in the absence of comprehensible signal, leadership defaults to skepticism.

The metrics translation problem is not a communication soft skill. It is an engineering discipline that most teams treat as optional until the funding review.

The AI Skills Inversion: When Junior Engineers Outperform Seniors on the Wrong Metrics

· 8 min read
Tian Pan
Software Engineer

A junior engineer on your team just shipped three features in a week. Your senior engineer shipped half of one. The dashboards say the junior is 6x more productive. The dashboards are lying.

This is the AI skills inversion — a measurement illusion where AI coding assistants make junior engineers look dramatically more productive on surface metrics while masking a deeper problem. The features ship faster, but the architecture degrades. The PRs multiply, but the system coherence erodes. And organizations that trust their dashboards over their judgment are promoting the wrong behaviors and losing the wrong people.