Skip to main content

Story Points Don't Survive First Contact With an LLM

· 8 min read
Tian Pan
Software Engineer

Here is a failure mode that happens quietly, at every company with a mature Agile practice that decides to add an LLM feature: the team estimates the work in story points, assigns it to a two-week sprint, and then spends three sprints in a row reporting "70% done" while the engineering manager stares at a burndown chart that refuses to burn down. Nobody lied. The feature is genuinely hard to finish — because the conditions that make story points a useful planning tool don't exist for AI features, and nobody noticed until they were already committed.

The problem is not that engineers are bad at estimating. The problem is that story points encode assumptions about the nature of software work — assumptions that LLM features violate structurally, not accidentally.

The Three Assumptions Story Points Rely On

Story points work when three things are roughly true:

  1. Tasks are deterministic. You can imagine the finished state before you start. The implementation path has uncertainty, but the definition of done does not.
  2. Velocity is stable. Past throughput predicts future throughput, so a team with a 40-point average can commit to 40 points of new work per sprint.
  3. Requirements freeze at done. Once a feature ships, it stays done. You don't re-evaluate it next quarter against a new acceptance threshold.

A standard CRUD feature satisfies all three. An LLM feature satisfies none of them.

Non-Determinism Breaks the Definition of Done

A document classifier built on a rule engine either classifies correctly or it doesn't — you can write a binary acceptance test. A document classifier built on an LLM might have 87% accuracy on your test set, 91% on Tuesdays, 84% after a model provider updates their base model, and completely different error distributions across document length buckets. None of these states is clearly "done" or "not done."

The consequence is that teams end up with two incompatible framings of the same sprint card. Engineering thinks "done" means the feature is deployed and not crashing. Product thinks "done" means users are satisfied with output quality. QA thinks "done" means the eval suite passes. All three are reasonable. None is the same thing.

The fix that works in practice is an explicit eval threshold as the acceptance criterion — "f1 ≥ 0.87 on the held-out validation set, with regression evals gating every deploy" — written into the ticket before estimation begins. This is not optional polish; it is the only way to make "done" mean something consistent. Until you have it, the ticket should not be estimated at all, because there is nothing stable to estimate.

Research Phases Have No Velocity

Sprint velocity is a throughput metric. It works when work is primarily execution — applying known techniques to ship something you could have fully specified in advance. ML and LLM features include a research phase where the path is not known: choosing an approach, running experiments, evaluating which prompt structure or retrieval strategy or fine-tuning technique actually improves the eval, and discarding the ones that don't.

You cannot estimate research throughput with the same unit you use for execution throughput, because the number of experiments you will need to find a working approach is unknown before you start. It is not unknown the way "this ticket might be a 3 or a 5" is unknown. It is unknown the way "I don't know how many hypotheses I'll need to test before one is supported by the data" is unknown. These are categorically different kinds of uncertainty.

A team that spent three sprints unable to move a recall metric past 0.71 was not slow. They were doing research. Calling it "40% of the sprint" and sticking it on a burndown chart does not make it sprint work.

Requirements Drift When the Model Does

Standard Agile assumes requirements are set by product managers, approved by stakeholders, and then frozen until the feature ships. For AI features, requirements also change when the underlying model changes — which happens on a schedule you don't control.

A team that tuned their summarization prompt against GPT-4 Turbo will find that the same prompt behaves differently after a provider silently updates the underlying weights. A retrieval system evaluated against one embedding model will degrade if the embedding model is retrained. A fraud detection classifier will silently drift as the distribution of transactions shifts with new products, new markets, or regulatory changes — none of which appear in any sprint backlog.

These are not edge cases. They are the normal operating condition of production AI systems. Treating LLM features as "done once shipped" is a planning error, not just a maintenance oversight.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates