Skip to main content

Why Your AI Roadmap Shouldn't Have a 12-Month Plan

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter spent six weeks building a "smart document classifier" — fine-tuned model, eval harness, custom UI, the whole production pipeline. It shipped on a Tuesday. The following Monday, a new general-purpose model dropped that beat their fine-tune on the same eval, zero-shot, with no infrastructure investment. Their entire Q2 OKR became a wrapper around a one-line API call. The roadmap had committed twelve months earlier to "owning the classification stack." That commitment was wrong before the ink dried.

This is not an isolated story. Industry trackers logged 255 model releases from major labs in Q1 2026 alone, with roughly three meaningful frontier launches per week through March. Costs have collapsed: API pricing is down 97% since GPT-3, and the gap between top providers has narrowed to within statistical noise on most benchmarks. When the underlying substrate changes this fast, a twelve-month feature roadmap is not a plan — it is a list of bets you cannot revisit, made with information that will be stale before you ship the second item.

The temptation is to read this and conclude "stop planning." That is the wrong lesson. The right lesson is that the unit of planning has to change. A roadmap that promises features by date assumes the cost and capability of the building blocks are stable. They are not. What you actually have is a portfolio of capability bets, each with a validation clock and a kill condition. Treating that portfolio like a feature roadmap is a category error, and it shows up as the same three failure modes in team after team.

The Three Ways Long Roadmaps Fail in AI

The first failure is the moat-free feature. You commit to building something that, by the time you ship it, the platform vendor will offer for free as a default. PDF extraction, basic summarization, simple classification, transcription, embedding-based search — every one of these was a defensible product surface in 2023 and is now a checkbox on a model provider's pricing page. If your eighteen-month roadmap has features that depend on the current capability gap, you are racing a curve that is bending against you. The longer your roadmap, the more of it sits below the rising waterline.

The second failure is the wrong-substrate bet. You commit to a technical approach — fine-tuning, a particular agent framework, a specific embedding model, a custom RAG pipeline — and six months later a new model architecture or capability obsoletes the entire stack. Teams that built elaborate retrieval pipelines in 2024 watched long-context models compress half their work into a system prompt. Teams that built complex tool-orchestration layers watched native tool-use capabilities subsume their custom routing. The bet was not on the feature; it was on the substrate, and the substrate moved.

The third failure is the frozen hypothesis. You committed to solving a user problem in a specific way, but you also locked in your assumptions about what users would tolerate, what latency they would accept, what UX shape the feature should take. Then real usage data arrives at month four and contradicts every assumption. In a normal product, you would refactor the plan. In an "approved twelve-month roadmap," you finish what you committed to, ship a thing nobody asked for in the form they wanted, and call it execution. More than 60% of AI roadmaps written today are functionally obsolete within nine months — and the org's response is usually to ship them anyway, because the alternative requires admitting the plan was wrong.

What Replaces the Roadmap: A Portfolio of Capability Bets

If features-by-date is the wrong unit, what is the right one? The pattern that holds up is treating AI work as a portfolio of small, time-boxed capability bets, each one structured so you can tell quickly whether it is working and kill it cleanly if it is not.

A capability bet is not "build feature X by Q3." It is a hypothesis: we believe that capability Y, applied to user segment Z, produces measurable outcome W. It has three things a roadmap item does not: a falsifiable claim, a fixed time-box (usually 4–8 weeks), and an explicit kill condition that says when you stop. The portfolio metaphor matters because no single bet has to work — the system is designed for some to fail, and the failure of any one bet is a signal, not a setback. This is closer to how research labs run than how product teams traditionally plan, and that is the point: the work has more in common with applied research than with shipping a known-good feature.

Each bet should answer four questions before it starts. What capability are we testing? (Specific: "long-context reasoning over 200k-token contracts," not "AI for legal.") What outcome would prove it works? (Quantitative: "≥80% extraction accuracy on the held-out test set, with median latency under 8s," not "users like it.") What is the budget? (Both calendar time and dollars — say, six engineering weeks plus $20k of inference credit.) What kills it? (A pre-committed threshold: "if accuracy stays below 65% after three iteration cycles, we shut it down regardless of how much we have spent.") Without the kill condition, every bet drifts into a small cult of sunk cost.

Why Kill Criteria Are Load-Bearing

The kill condition is the single most undersupplied piece of AI strategy work. Teams find it easy to write success criteria — they are fun to imagine. They find it nearly impossible to commit, in advance, to the conditions under which they will walk away. So bets that should have died at week six get extended to week sixteen, and the team learns nothing because they cannot tell the difference between "this needs more time" and "this approach is fundamentally wrong."

A useful kill criterion is specific, pre-committed, and uncomfortable. "If our eval suite does not show a 15-point improvement over the zero-shot baseline by end of week six, we abandon fine-tuning and switch to prompting + retrieval." That sentence does several things at once. It defines the comparison (zero-shot baseline, not arbitrary internal metric). It defines the magnitude (15 points, not "meaningful improvement"). It defines the date. And it defines the next move so the kill is not framed as failure but as a planned branch in the decision tree. If you cannot write that sentence at the start of a bet, you are not running a portfolio — you are running a wishlist.

The reason this matters more for AI than for traditional software is that the failure mode of an AI bet is rarely a bug. It is a quiet underperformance: the system mostly works, the demos look fine, but the eval metric is two points lower than the baseline you should have been comparing against. Without kill criteria, that two-point gap turns into a six-month investment in a system that was, all along, slightly worse than the simpler alternative you ruled out in week one.

What to Plan for 12 Months Out

This is not an argument against thinking long-term. It is an argument against committing long-term to specific features and implementation choices. The things that survive a 12-month horizon are not features; they are durable assets that get more valuable as the model layer commoditizes underneath them.

Worth planning a year out: proprietary data flywheels (the labeled examples, user corrections, and outcome signals only your product can collect — these compound regardless of which model you are using). Distribution and integration depth (where your product sits in the user's workflow, the partnerships and APIs you have wired in, the trust relationships with enterprise buyers). Eval infrastructure (your ability to know, faster than anyone else, whether a new model is actually better for your use case — when capability shifts every quarter, the team that can re-evaluate in 48 hours beats the team that takes six weeks). Trust and brand (compliance posture, security certifications, the slow-moving reputation that lets you sell into regulated industries). Domain expertise embedded in the team (the people who understand what the user actually needs are the bottleneck on every bet, not the inference cost).

Notice what is not on that list: specific model integrations, specific UI surfaces, specific feature names. Those should live in the rolling 90-day window, get re-evaluated each cycle, and be cheap to throw away. The twelve-month plan should describe the board you are building on, not the moves you will make on it.

Running the System in Practice

The mechanics are simpler than they sound. Replace the annual roadmap document with two artifacts: a long-horizon "capability thesis" document (what durable assets you are building, what you believe about where the platform layer is going, what kinds of bets you will and will not make) and a rolling 90-day portfolio sheet (the specific bets currently active, their kill criteria, their owners, their budgets, what we have learned from bets that ended). The capability thesis updates quarterly and changes slowly. The portfolio sheet updates weekly and is supposed to churn.

Review cadence matters. Bet kickoff is a 30-minute meeting where the four questions above get answered on a single page. Mid-bet check-ins are short and ruthless: are we on track to hit the kill criterion by the deadline, yes or no? Bet closeout — whether the bet succeeded, failed, or got killed — produces a one-page write-up of what you learned about the capability, the user, the substrate, or your eval. Those writeups are the most valuable artifact your team produces, because they are the only thing that compounds across cycles when the technology underneath you is changing this fast.

The hardest part is cultural, not procedural. Executives and stakeholders trained on traditional roadmaps will keep asking "what is the AI roadmap for next year?" The answer — "we are running a portfolio of bets in this capability space, with these kill criteria, and we will tell you in 90 days what worked" — sounds, to them, like you don't have a plan. It is, in fact, the only honest plan available given how the underlying technology behaves. The teams that have made this shift are not less rigorous than the ones with twelve-month Gantt charts. They are more rigorous, because they have committed in writing to the conditions under which they will admit they were wrong. That commitment is the actual artifact of strategy. Everything else is decoration.

References:Let's stay in touch and Follow me for more thoughts and updates