Skip to main content

Cohort-Aware Fine-Tuning: When One Model Isn't Enough But Per-User Is Too Much

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a fine-tuned model that beat their base by four points on their internal eval, then watched their top three customers churn over the following six weeks. The eval was fine. The aggregate was fine. The fine-tune just happened to win on the median user, who was a small-business buyer asking short factual questions, while silently regressing on the enterprise legal cohort whose long, citation-heavy queries had been the actual revenue driver. Nobody had sliced the eval by customer tier because nobody on the modeling side knew the customer tier mattered.

Most fine-tuning conversations live at one of two extremes. On one end, the "one fine-tune to rule them all" approach trains a single specialized model on a mix of all customer data and washes out the cohort-specific behavior that actually distinguished segments in the base model. On the other end, the "per-customer fine-tune" approach trains a separate adapter for each tenant, which is operationally tolerable below a hundred customers and falls apart somewhere around a few hundred. The interesting middle ground — where a small number of cohort-aware fine-tunes serve a segmented user base — is missing from most production playbooks.

This middle is where the real engineering work lives. Cohort-aware fine-tuning treats fine-tuning as a segmentation decision dressed as a model decision. You're not really asking "what data should I train on" — you're asking "for whom does this model exist, and how many distinct fors are there." The team that doesn't ask the second question first ends up training against an aggregate distribution that doesn't represent any actual customer.

The Missing Middle Between Aggregate and Per-User

The reason the extremes dominate is that each one has a clean failure mode that's easy to defend in a design doc. The single fine-tune fails the same way for everyone, which feels fair. The per-tenant fine-tune isolates blast radius perfectly, which feels safe. What both miss is that production traffic almost never has a flat distribution. There are usually three to eight clusters that matter, and the gap between the best and worst of them is large enough that averaging over the whole base loses most of the signal.

The classic symptom is an eval score that improves while CSAT or retention numbers stay flat or get worse. This is what a regressed cohort looks like from the outside. The model is winning on the bulk of traffic, which is exactly where the eval is sampled most heavily, and it's losing on a tail that the eval doesn't represent in proportion to its revenue weight. Aggregate metrics under-represent severe failures on rare subpopulations, and the rare subpopulations in your business are usually the high-revenue ones.

Per-customer fine-tuning solves this by definition, but the operational cost is brutal. Every customer you onboard becomes a training pipeline you owe forever. Every base-model upgrade becomes N retraining jobs. Every regression becomes N investigations. Cohort-aware fine-tuning gets you most of the segmentation benefit at a small constant cost.

Finding the Right Number of Cohorts

The first decision is how many cohorts you actually have. Too few and you're back to the aggregate problem with extra steps. Too many and you're approximating per-user fine-tuning while pretending you aren't. The useful answer for most products is between three and eight.

Practical signals that define cohort boundaries:

  • Industry vertical — legal, healthcare, finance, retail. Vocabulary, citation expectations, and risk tolerance differ enough that a single model averages them poorly.
  • Account tier — free, pro, enterprise. The query distribution at each tier is genuinely different, not just bigger.
  • Language locale — even within "the model speaks English," British legal English and US technical English ask different things of the same prompt.
  • Task type — summarization vs. extraction vs. open-ended generation. These often dwarf customer identity as a behavior driver.
  • Interaction pattern — single-shot API users vs. agent workflows vs. interactive chat. Tool-use density and turn count change what good completions look like.

The right cohort axis isn't always the one product marketing uses. The marketing segments are often the cohorts that distinguish how customers buy, not how they use the product. The cohorts that matter for fine-tuning are the ones that distinguish how customers query the model. A cohort identification pass on production traffic, clustering on prompt embeddings and response distributions rather than account metadata, almost always shifts at least one boundary.

You will probably end up with a small number of overlapping axes — vertical × tier, for example — and you'll have to decide whether to crossproduct them (eight cohorts) or pick the dominant axis (four). Pick the dominant axis unless you have a clear reason not to. Cohort count compounds operational cost faster than it compounds quality.

Slicing the Eval Before Training Anything

The single most important discipline is to slice the eval before you start training cohort-aware models, not after. If your current eval is a flat number and you can't reproduce its score broken down by cohort, you cannot tell whether a cohort-aware fine-tune is actually winning. You can only tell whether the average moved.

A per-cohort eval slice means three things in practice. First, every example in the eval set is tagged with its cohort, and the eval reports a score per cohort plus the aggregate. Second, the eval set's cohort mix is tracked as a separate property — if 70 percent of your eval is one cohort, the aggregate is mostly that cohort's score. Third, regressions on any single cohort block a release independently of the aggregate. A model that gains two points overall but loses three on the enterprise cohort doesn't ship.

Teams that skip this step almost always rediscover it the hard way. They ship a cohort-aware fine-tune, the aggregate goes up, a customer escalates, somebody slices the data after the escalation, and the cohort regression was visible in the offline eval the whole time — there just wasn't a slice that would have surfaced it.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates