Skip to main content

34 posts tagged with "engineering-leadership"

View all tags

The 40-Point Gap Between Your Interviewers When the Candidate Says 'I'd Just Prompt It'

· 9 min read
Tian Pan
Software Engineer

The candidate hit the wall on the system-design question, paused for two seconds, and said: "I'd just prompt it." Your most senior interviewer wrote strong hire — this is exactly how good engineers work in 2026. Your second-most-senior interviewer wrote no hire — handing the problem to a chatbot is not engineering. Same five words. Same forty-minute window. A forty-point gap on the same scorecard.

The candidate didn't fail your loop. Your loop failed to have an opinion. And the worst part of the debrief is not the disagreement — it's the way each interviewer is so confident their read is the correct one that the meeting devolves into a referendum on AI itself rather than on whether this human can ship.

The Agent Rollout Cadence Your Customer Success Team Could Not Absorb

· 11 min read
Tian Pan
Software Engineer

The customer pasted the agent's answer into a support chat and asked the human rep to confirm it. The rep, looking at the same product, said the opposite. The customer did not lose trust in the agent that day. They lost trust in the company, because two parts of it told them two different things in the same hour.

Nothing was broken. The AI team had shipped a prompt change on Tuesday behind a feature flag, ramped it to 100% by Thursday, and moved on. The customer success team's enablement cycle is monthly — that is how every other product feature has always landed, and nobody re-negotiated the contract for AI. The macro in the CS rep's queue and the FAQ doc on the public site still described the previous behavior. The agent was correct. The rep was correct against the documentation they had. The company was incoherent.

The AI Feature Your CTO Funded That Your Security Team Will Not Let You Ship

· 11 min read
Tian Pan
Software Engineer

The post-mortem says "we found security too late." The actual finding is that security found you on time. Your process found security too late.

This is the AI feature that cleared the budget gate in January because the CTO and the CFO agreed the company needed an AI moment. It cleared a light legal review in March because it was a prototype. Engineering built against the agreed spec through Q2. In late July, the launch-readiness security review opened, and on day one the threat model came back with blockers on the auth scopes, the data-exfiltration paths, the model provider's residency story, and the prompt-injection surface. The team's quarter is now spent rebuilding to address findings that should have shaped the original spec. Two quarters of slip, an executive memo about "process improvements," and a quiet decision next planning cycle to "deprioritize AI deep-integrations."

The launch did not fail because security was slow. It failed because security entered after the shape of the feature had already been frozen.

The Legal Review Timeline Your AI Feature Roadmap Never Costed

· 10 min read
Tian Pan
Software Engineer

You sketched a six-quarter AI roadmap. The model swap, the new data source, the multilingual launch, and the prompt that now offers advice each got a single row on the Gantt chart, sized by engineering effort. Then the first launch slipped four weeks, and the post-mortem said the same thing three times in three different sections: "waiting on legal." The roadmap had assumed engineering capacity was the binding constraint. The actual binding constraint was a queue of legal reviews, each running its own three-to-six-week SLA, none of them aware of each other, and all of them landing on the same two product counsels.

The mistake was not in any of the individual reviews. Each one was warranted. The mistake was treating four parallel features as four parallel timelines while their legal dependencies serialized through the same upstream resource. By the second slip the org learns the shape of the problem. By the fourth it learns to plan against it. The teams that ship AI features on a predictable cadence have stopped treating legal throughput as an external surprise and started treating it as a planning input on the same footing as headcount and infra capacity.

The AI Standup Where Yesterday's Status Is a Lie

· 9 min read
Tian Pan
Software Engineer

The team meets at 10am. The first engineer reports what their agents finished overnight. Except the eval suite that kicked off at 7am hasn't returned, the PR the agent opened at 3am is waiting on a review from another agent whose queue depth is unknown, and the long-running refactor agent is on hour eleven of an estimated four-hour run with no signal that it's stuck and no signal that it's healthy. Yesterday's status is not "done" and not "in progress." Yesterday's status is unknowable from inside the room.

The standup was a synchronous ritual built for synchronous human work. Each person did a thing, finished it, slept on it, and reported it the next morning. The unit of work was a workday. The unit of reporting was a person. The cadence matched the substrate. None of that holds anymore. The unit of work is now an agent run that started before you went to bed and may finish during the meeting or three hours after. The unit of reporting is a fleet, not a person. And the cadence — a 9- to 15-minute round-robin at 10am sharp — is a frequency the substrate doesn't produce events on.

The Perf Review Template That Cannot See AI Work

· 11 min read
Tian Pan
Software Engineer

Your strongest AI engineer spent the cycle curating an eval set, calibrating a judge prompt, and killing two features that turned out to be task-shape mismatched. None of that fits a single line on the review template. So the calibration meeting either inflates the artifacts the engineer cares least about — PR count, design docs, on-call shifts — or invents prose to justify a high rating the framework cannot defend. Either way, the rubric and the reality are pulling in different directions, and the engineer can tell.

The template was written for deterministic software. It rewards what you can count: lines of code shipped, services owned, incidents resolved, hours spent on-call. The AI roadmap is moved by a different shape of work: curating a representative eval slice, defending a behavioral envelope under model drift, refusing to ship a feature whose task shape doesn't fit the model, and patiently shrinking the gap between a judge prompt and human intent. Almost none of that produces the artifacts the rubric was built to count.

Inference Billing as a P&L Line Item Nobody Owns

· 9 min read
Tian Pan
Software Engineer

Somewhere in your company, four people each believe a fifth person owns the inference bill. Engineering treats it as a cloud line item. The AI team treats it as the price of building. Finance treats it as a variable margin input that someone in engineering must already be managing. Product treats it as overhead that engineering absorbs. The bill keeps growing, and the only thing everyone agrees on is that it isn't theirs.

This is not a budgeting problem. It is an ownership vacuum, and it surfaces the first time the line item gets large enough for a CFO to ask about it on a board call. By then, the answers people improvise — "we'll optimize," "we'll cache more," "we'll switch models" — describe interventions without naming an owner. The conversation that should have happened a year earlier was not about how to lower the bill. It was about whose P&L the bill belonged to in the first place.

The shift is structural. Inference moved from 15% of enterprise AI spend in 2024 to roughly 85% in 2026, and the average enterprise AI budget grew from $1.2M to around $7M over the same window. A line item that was once rounding error is now the kind of number a board notices, and the org chart written before that shift has no row for it.

The AI Literacy Gap Inside Your Own Team Is the Biggest Delivery Risk on Your Roadmap

· 10 min read
Tian Pan
Software Engineer

Your hiring page asks for AI experience. Your launch announcement names the AI features. Your roadmap commits to two more this quarter. And on the team that has to ship and maintain all of it, one engineer actually knows how to debug an eval failure, two can edit a prompt confidently, and twelve treat the LLM call as a black box they hand off whenever it misbehaves.

That distribution is the delivery risk nobody on your leadership team has named, because the team's stated AI capability — the thing that goes on the slide — is the maximum of any individual member's skill, and the team's actual delivery velocity is the median. The slide says one number; production runs on the other.

When Two AI Features Compete for the Same Click

· 9 min read
Tian Pan
Software Engineer

A user lands on a search results page. Team A's smart summary fires in the top banner: "Here's the gist — skip the list." Team B's inline assistant pulses on the side: "Stay here, I'll keep reading with you." Both prompts compete for the same 800ms of attention, and the user — annoyed — closes the tab. The next morning, Team A reports a 6% lift in summary clicks; Team B reports a 4% lift in assistant opens; nobody in the room is wrong, and the product is worse than it was a quarter ago.

This is the failure mode that the standard playbook of independent feature teams and per-feature A/B tests cannot see. Each team locally optimized against its own metric. The user — who only has one attention budget, one mental model, and one click to give — paid the bill for the integration both teams declined to do.

The Eval Budget Your CFO Cannot See on a Spreadsheet

· 8 min read
Tian Pan
Software Engineer

Open any quarterly planning spreadsheet and you can find every feature your team shipped, every contractor invoice, every cloud line item. What you will not find is a row for the outage that never happened, the hallucinated refund that was caught before it reached a customer, or the prompt regression that an eval blocked at 2 a.m. Those non-events have no SKU. They generate no ticket, no postmortem, no Slack thread. And so, when the eval budget comes up for renewal, it is competing for headcount against a feature that has a demo — and it loses, almost every time.

This is not a failure of nerve. It is a measurement problem. Eval investment behaves like a safety net and a test suite at the same time: it compounds quietly, it pays out in disasters avoided, and its entire value is counterfactual. Finance is structurally blind to counterfactuals. If you lead an AI team, your job is not to argue that evals are important — everyone already nods at that. Your job is to make a compounding, invisible return legible to people who only trust spreadsheets.

The AI Told Me So Defense: When Code Review Quietly Stops Pushing Back

· 11 min read
Tian Pan
Software Engineer

The single most expensive sentence in a 2026 code review thread is "the agent wrote it this way." Not because it's wrong — sometimes it isn't — but because it ends a conversation that used to start one. The reviewer types a question, the author quotes the model's reasoning back at them, and the thread resolves before anyone has actually argued about the change. The social cost of disagreeing with a confident, well-spoken model has quietly become higher than the cost of merging a subtle bug, and most teams won't see the trade in their metrics for another two quarters.

This is not a story about whether AI writes good code. It writes code, some of it good. This is a story about what happens to a quality gate when the friction at composition time collapses. Review velocity rises, defect rate rises in lockstep, and the correlation isn't obvious because nobody is tracking review-time-to-defect with the author class attached. The senior engineer who used to be the gravity well of taste in the codebase becomes the lone holdout in a culture quietly recalibrating around model deference.

AI Code Review Drift: When Your LLM Reviewer's Standards Mutate Faster Than the Code

· 9 min read
Tian Pan
Software Engineer

The PR-review dashboard has shown green for six weeks. Bot catch rate, comment volume, developer "thumbs up" reactions — all steady. Then a security incident lands in production and the post-mortem points at a missing null-check the bot used to catch and quietly stopped catching about two months ago. Nobody changed the bot. Nobody downgraded the model. The dashboard never moved. The standard moved.

This is the failure mode of automated code review that doesn't show up in any product demo. Teams adopt an LLM reviewer for the consistency win — every PR gets the same checklist, no senior engineer's bad-day variance, fast turnaround for junior contributors — and the consistency is real for about a quarter. Then the system prompt evolves, the model bumps, the few-shot library accumulates, and the bot is reviewing a different codebase against a different rubric using a different model than the one the team validated against. The team's mental model of "what the bot catches" decays into "what the bot caught last week."