Skip to main content

The Two-Speed Organization: Why AI Teams and Product Teams Run on Incompatible Clocks

· 10 min read
Tian Pan
Software Engineer

Your ML team ran a promising experiment. The model beat the baseline by 8 points on your eval set. Stakeholders are excited. Then it took four months to ship — and by the time the feature launched, the product roadmap had moved on, the team that requested it had a different priority, and half the infra work got redone because the deployment target changed mid-flight. Sound familiar?

This is the clock-mismatch problem: AI teams and product teams run on fundamentally different time scales, and most organizations treat this as a coordination failure when it is actually an architectural one. You cannot fix a structural mismatch with a better standup cadence.

The Three Clocks That Never Agree

To understand why this keeps happening, it helps to name the three distinct time scales at play in any AI-powered product.

Model experimentation runs in weeks. A single experiment — define hypothesis, gather training data, run fine-tuning or prompt evaluation, establish eval baseline, run ablations, get approval — takes three to four weeks on a good day. Larger efforts (new model architectures, new task types, retrieval system redesigns) can stretch to three months or more. This is not sloppiness; it is the minimum viable cycle for doing ML work rigorously.

Product shipping runs in days. A feature team operating on two-week sprints can ship a visible UI change in under a week. Even teams with longer planning cycles can deploy a config change or copy update in hours. The tooling and expectations around product velocity have been optimized for this cadence for the last fifteen years.

Embedding and retrieval index updates run on monthly cycles. Rebuilding a production vector index for even a mid-size corpus takes hours. Validating that the rebuild did not break retrieval quality takes more hours. Doing this safely without causing a regression in a live product means coordinating a deployment window — which, in practice, means it happens once a month if you are disciplined, quarterly if you are not.

Three teams. Three cadences. None of them aligned. This is not a coincidence; it reflects the underlying physics of each type of work.

How the Mismatch Produces the "Always in Beta" Failure Mode

The practical consequence of running three clocks in one product organization is a specific failure mode: AI features live in permanent beta.

Here is the typical timeline. The ML team finishes their experiment. The feature is promising enough to ship, but it needs a product wrapper — a UI, an API surface, an integration with the existing data pipeline. The product team is mid-sprint on something else. By the time product is ready to integrate, the ML team has run two more experiments that changed the recommended approach. Meanwhile, the retrieval index that the first experiment was designed against has been updated, so the eval numbers from the original experiment no longer reflect production behavior. The handoff takes another three weeks of re-alignment. By launch, the model is already slightly stale, the product team feels like they are shipping someone else's work, and the ML team is frustrated that the thing took six months.

The number that quantifies this dysfunction: only 15–22% of AI models that complete development ever reach production. Of those that do, fewer than 40% sustain measurable business value beyond twelve months. These are not numbers about model quality. They are numbers about organizational misalignment.

Shadow Testing: The Pattern That Bridges the Two Speeds

Shadow testing — also called shadow deployment or champion-challenger testing — is the most effective structural solution to the AI-product clock mismatch, and it is still underused.

The core idea is simple: you run the experimental model in parallel with the production model, against live traffic, without exposing the experimental model's outputs to users. The production model returns real predictions; the experimental model processes the same inputs in the background, and all its outputs are logged. No user is affected. But you accumulate a real-world performance record for the challenger before it ever touches a real response.

This matters for the clock problem in a specific way: shadow testing decouples the experimentation cadence from the shipping cadence. The ML team can run challengers against live traffic continuously, without requiring a product deployment. They accumulate evidence about production performance over days or weeks. When a challenger is ready to graduate, the case for shipping it is already built — the product team does not need to trust an eval on a frozen dataset; they can see how the model behaved on last week's actual queries.

The infrastructure requirement is modest: a load balancer or API gateway that can mirror traffic to a shadow endpoint, and an observability layer that can compare outputs across champion and challenger. The operational overhead is the processing cost of running two models, which is roughly 2x inference cost for the shadowed endpoints. For most organizations, this is worth paying to eliminate the six-week re-alignment cycle.

Key things to instrument in a shadow setup:

  • Latency difference between champion and challenger (regressions here block shipping even if accuracy improves)
  • Output divergence rate (how often do champion and challenger disagree, and on what kinds of inputs)
  • Error rates and unexpected output formats
  • Resource consumption under production traffic distributions (experiments often underestimate token count and tool call fan-out)

Async Batch Experiments: Moving Experimentation Out of the Critical Path

The second pattern that helps with the clock mismatch is moving model experiments out of the real-time serving path entirely and into an async batch processing workflow.

Most AI teams run experiments against production-representative datasets in a notebook or internal eval harness. This works, but it keeps the experimentation cycle tightly coupled to the data pipeline — you need a snapshot of production data, you need compute, and you need someone to monitor the run. More importantly, it keeps experiments on the product team's timeline: every experiment is a potential blocker for the next sprint.

Batch experimentation breaks this coupling. Instead of running a blocking evaluation, you submit a batch job that processes your evaluation set asynchronously, often at significant cost savings (50%+ cost reduction compared to real-time processing, with higher rate limits). The ML team runs experiments on their own cadence without holding a product meeting. Results come back in hours. The product team picks up the results when they are ready, rather than waiting on a synchronous handoff.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates