Skip to main content

Model Merging in Production: Weight Averaging Your Way to a Multi-Task Specialist

· 13 min read
Tian Pan
Software Engineer

By early 2024, the top of the Open LLM Leaderboard was dominated almost entirely by models that were never trained — they were merged. Teams were taking two or three fine-tuned variants of Mistral-7B, averaging their weights using a YAML config file, and beating purpose-trained models at a fraction of the compute cost. The technique looks trivially simple from the outside: add some tensors together, divide by two, ship it. The reality is more nuanced, and the failure modes are sharp enough to sink a production deployment if you don't understand what's happening under the hood.

This is a practical guide to model merging for ML engineers who want to use it in production: what the methods actually do mathematically, when they work, when they silently degrade, and how to pick the right tool for a given set of constituent models.

What Model Merging Actually Does

The central idea is this: after fine-tuning, a model's weights encode the adaptation as a perturbation on top of the base model. If two fine-tuned models start from the same base and their perturbations don't directly contradict each other, you can average those perturbations and get a model that carries both adaptations.

This is distinct from ensembling. An output-space ensemble runs multiple models at inference time and combines their predictions — it costs proportionally more compute and memory. Model merging produces a single set of weights. At inference time it's identical in cost to running one model. The performance profile is also different: ensembles reduce variance across individual model errors; merging tries to accumulate the strengths of different fine-tuning trajectories into one weight space.

The constraint that matters: models being merged must share the same architecture. You can't merge a 7B model with a 13B model by simple weight arithmetic. Architecture must match layer-for-layer.

The Five Core Methods

Weight Averaging (Model Soup)

The simplest form is a linear average: θ_merged = Σ(wᵢ · θᵢ) / Σwᵢ. The "model soup" paper (Wortsman et al., 2022) showed that averaging the weights of multiple fine-tuned variants of the same model consistently outperforms any individual constituent, and approaches the performance of full output-space ensembles at zero inference cost.

Two variants matter in practice:

  • Naive soup: average everything regardless of quality.
  • Greedy soup: rank models by validation performance, add each one only if it improves the running average. This consistently outperforms naive averaging.

The limitation is that naive linear interpolation ignores interference between models. When two fine-tuned models have modified the same weight in opposite directions, averaging them does something meaningless — it drives that weight toward zero, losing the signal from both.

SLERP (Spherical Linear Interpolation)

Originally developed for quaternion-based animation in 1985, SLERP treats weight vectors as points on a high-dimensional sphere and interpolates along the geodesic rather than the chord. The formula:

θ = arccos(v₀ · v₁)
result = sin((1-t)·θ)/sin(θ) · v₀ + sin(t·θ)/sin(θ) · v₁

The difference from linear interpolation is that SLERP preserves the norm of the weight vectors throughout the path, avoiding the magnitude collapse that can happen when two weight vectors point in partially opposite directions. In practice this means smoother interpolation and better preservation of both models' representational geometry.

SLERP is limited to merging exactly two models at a time. For production use cases where you need to merge three or more specialists, you need to either chain pairwise merges (hierarchically) or move to a different method. It remains the most popular technique for pairwise merges, particularly for combining a general instruction-following model with a domain specialist.

Task Arithmetic (Task Vectors)

The task arithmetic paper (Ilharco et al., ICLR 2023) introduced a more principled framing. Define a task vector as the delta between the fine-tuned and base weights:

τᵢ = θ_finetuned_i - θ_base

Task vectors support arithmetic. Adding them combines capabilities. Negating one subtracts a capability. The multi-task merged model is:

θ_merged = θ_base + α · Σ τᵢ

The key insight is that movement in a task vector's direction through weight space consistently improves performance on that task. The vectors are structured, not random — and this structure makes arithmetic meaningful rather than coincidental.

The paper demonstrated cross-task transfer: applying a task vector for one task sometimes improves a related task for which no fine-tuning was done. This points to genuine semantic structure in how fine-tuning modifies weights, not just overfitting to task-specific patterns.

The failure mode is task interference: when two tasks modify the same weight regions in conflicting ways, summing their task vectors produces noise in exactly those regions.

TIES-Merging (Trim, Elect Sign, Merge)

TIES (Yadav et al., NeurIPS 2023) directly addresses the interference problem with a three-step process:

  1. Trim: For each task vector, zero out all but the top-k% parameters by magnitude. Most task vector components are redundant — empirically, keeping only 20% of parameters by magnitude maintains nearly full performance. Eliminating low-magnitude parameters reduces noise in the merged result.

  2. Elect Sign: For each weight position, look at the signs across all task vectors. Choose the sign held by the majority (weighted by magnitude). Parameters with minority signs are zeroed out.

  3. Disjoint Merge: Average only the surviving (non-zero) parameters with aligned signs.

This is more expensive to compute than simple averaging, but the resulting merged models are substantially more robust when combining three or more fine-tuned models. The trim step alone — applied before any other merge method — consistently improves results.

DARE (Drop And REscale)

DARE (Yu et al., 2024) takes a different approach to interference reduction: randomly drop task vector parameters with probability p, then rescale the survivors by 1/(1-p) to maintain the expected value. The drop removes redundant parameters stochastically; the rescaling compensates for the reduction.

The empirical finding is striking: performance is preserved even when dropping 90–99% of task vector updates. Most fine-tuning parameters are redundant. DARE can be combined with TIES (as dare_ties) or used with linear merging (dare_linear). It's particularly useful when merging aggressively different specializations, where interference would otherwise be severe.

When Merging Beats Ensembles

The performance comparison depends on what you're optimizing:

  • Inference cost: Merging wins decisively. One model, one forward pass, one serving deployment.
  • Memory: Merging requires storing one model. An ensemble requires storing N.
  • Raw accuracy: Ensembles have a ceiling advantage — combining diverse predictions reduces variance in a way merged weights cannot fully replicate. But the gap is often small.
  • Latency: Merging wins. Ensemble latency scales with the number of constituent models; merged models have fixed single-model latency.

The practical calculus for production: if you're operating at scale and inference cost matters, ensembles become economically prohibitive fast. A 5-model ensemble at 10M daily requests is 5x your inference bill. A merged model costs the same as one. For teams where the ensemble performance gap is within acceptable error bounds, merging is the obvious choice.

The Marcoro14-7B-slerp model — a SLERP merge of two Mistral-7B fine-tunes — briefly held the top position on the Open LLM Leaderboard in February 2024, beating larger purpose-trained models. It ran on a single consumer GPU.

The Quality-vs-Specialization Tradeoff

Merging creates a generalist from specialists. This comes with a predictable tradeoff: the merged model typically performs somewhat below each constituent on its specific domain, while performing substantially above either constituent when averaging across domains.

The shape of the tradeoff depends on:

  • Task similarity: Merging a coding model with an instruction-following model works well because the underlying capabilities reinforce each other. Merging a coding model with a sentiment-analysis model may produce interference because the fine-tuning targets different weight regions in conflicting ways.

  • Number of models merged: All methods show degraded average accuracy as you add more models, but the rate of degradation differs. TIES and DARE degrade more gracefully than naive task arithmetic because they handle interference explicitly. At 5+ models, naive averaging typically becomes unusable.

  • Base model quality: The base model's representations constrain how well merging can work. Models fine-tuned from the same high-quality base tend to merge more cleanly than models with different fine-tuning histories.

The WiSE-FT technique (Weight-Space Ensembles for Fine-Tuning) gives you a knob: it interpolates between the pretrained base and the fine-tuned model, letting you dial between base-model robustness and fine-tuned specialization. Instead of committing to a fixed merged point, you can explore the interpolation path on a validation set and pick the coefficient that hits your performance target.

Common Failure Modes

Sign conflicts: Different fine-tuned models will push the same weight in opposite directions. When this happens, averaging drives that weight toward zero, erasing the signal from both. This is not a rare edge case — it occurs even when merging just two models from different task families. TIES-Merging addresses this directly by aligning signs before averaging.

Architecture mismatch: Obvious in theory, easy to miss in practice. Tokenizer differences between models with nominally the same architecture can cause subtle degradation that looks like a merge quality problem but is actually a vocabulary mismatch. Always copy the tokenizer from one constituent and verify it matches across all models.

Inheriting the weaknesses of both parents: If Model A has a hallucination pattern on certain topics and Model B has a refusal bias on certain queries, the merged model tends to inherit both failure modes, not average them out. The failure modes are often orthogonal to the capability signals that merging successfully combines. Expect to need targeted red-teaming on the merged model, not just benchmarking.

Benchmark contamination: This is a live problem with publicly published merges. Models fine-tuned on datasets that overlap with benchmark test sets carry that contamination into merged models. A merged model hitting a high leaderboard score may be combining two models that were both slightly contaminated.

Loss landscape incompatibility: Models with substantially different fine-tuning procedures — different learning rates, different data ordering, different regularization — may have settled in different basins of the loss landscape. Linear interpolation between these basins can pass through a high-loss region rather than staying in a stable region.

This manifests as a merged model that performs worse than either constituent across the board — not just on specific tasks. Mode connectivity research suggests this risk is lower for models fine-tuned from the same base with similar hyperparameters.

Merging vs. Fine-Tuning for Continual Learning

Sequential fine-tuning has a well-documented failure mode: catastrophic forgetting. When you fine-tune a model on Task B, its performance on Task A degrades — sometimes severely — because gradient updates optimize for the new task at the expense of previously encoded knowledge.

Model merging addresses this differently. Instead of sequential weight updates, you merge the task-specific knowledge at the parameter level after each adaptation is fully trained. The base model knowledge is preserved in the base weights; task-specific knowledge lives in the delta. Because you're operating in weight space rather than gradient space, you're not subject to the sequential optimization pressure that causes forgetting.

The "model soup" approach for continual learning — train each new capability as a separate fine-tune, then merge incrementally — has shown stronger robustness to task ordering than sequential fine-tuning. Merging is less sensitive to which tasks were learned first, because the merge operation is commutative: merging A then B gives the same result as merging B then A.

The limitation is that this approach requires storing all constituent models until the merge is finalized, which has memory implications if you're managing many specializations.

Tools: mergekit and the Ecosystem

mergekit (github.com/cg123/mergekit) is the dominant open-source tool. It supports SLERP, TIES, DARE, task arithmetic, and passthrough (layer-stacking "frankenmerges"). Configuration is YAML-based and the library runs on CPU — no GPU required for the merge operation itself. A typical TIES merge of two 7B models takes a few minutes on a standard machine.

The t parameter in SLERP configs can be specified per layer type — separately for attention weights (self_attn) and MLP weights — which gives fine-grained control over which constituent model dominates in different parts of the architecture. Practitioners have found that varying t across layers often outperforms a single global interpolation coefficient.

Evolutionary merging (Sakana AI, March 2024, published in Nature Machine Intelligence) automates the search over merge configurations using evolutionary algorithms operating in both weight space and layer-ordering space. The approach discovered Japanese math-reasoning LLMs by merging a Japanese language model with an English math reasoning model — a cross-domain merge that wouldn't have been obvious to configure manually. The resulting models outperformed both constituents and were competitive with models several times larger.

FedMerge extends merging to federated settings, where multiple tenants fine-tune local copies of a base model on private data, then contribute only weight deltas to a central merge. The merged model benefits from all tenants' training data without any tenant exposing raw data. Microsoft's Azure AI team has published work on this pattern for multi-tenant SLM deployments.

Practical Guidance for ML Engineers

Start by establishing whether your constituent models are merge-compatible. Same architecture, same tokenizer, fine-tuned from the same base. If you're unsure whether they're in connected loss basins, run a quick linear interpolation between them and check if intermediate points maintain reasonable perplexity — a sharp drop in the middle of the interpolation path indicates basin incompatibility.

For two models: SLERP with a grid search over t on a validation set. Vary t separately for attention and MLP layers if you want more control.

For three or more models: TIES with a density parameter between 0.3 and 0.7 for each model, combined with DARE if you're seeing significant interference. Normalize the final result.

Always benchmark the merged model against each constituent on task-specific evals, not just aggregate benchmarks. The aggregate score can look good while one capability has regressed significantly.

Red-team the failure modes of each constituent explicitly. Don't assume merging averages them out — assume they're additive until proven otherwise.

For continual learning workflows: treat each new capability as a separate fine-tune, maintain a library of task vectors, and merge on demand. This preserves flexibility and lets you decompose a bad merge by removing specific task vectors rather than retraining from scratch.

The Field Is Still Evolving Quickly

The survey landscape in 2025 reflects how fast the field is moving. TIES and DARE were both proposed in 2023–2024; evolutionary merging formalized automated search in early 2024; layer-aware methods (Layer-Aware Task Arithmetic, 2025) are now decomposing the interference problem at finer granularity by treating early generalist layers differently from deep task-specialized layers.

The practical upshot is that the best merge method for any given pair of models is still empirically determined. There's no unified theory predicting which method will work for an arbitrary combination of fine-tunes. The tools are cheap to run, the search space is tractable, and the performance ceiling — as demonstrated repeatedly on public leaderboards — is higher than most practitioners expect. For teams managing multiple fine-tuned specializations, model merging deserves a place in the standard toolkit alongside LoRA and RLHF, not as a curiosity but as a production technique.

References:Let's stay in touch and Follow me for more thoughts and updates