Model Merging in Production: Weight Averaging Your Way to a Multi-Task Specialist
By early 2024, the top of the Open LLM Leaderboard was dominated almost entirely by models that were never trained — they were merged. Teams were taking two or three fine-tuned variants of Mistral-7B, averaging their weights using a YAML config file, and beating purpose-trained models at a fraction of the compute cost. The technique looks trivially simple from the outside: add some tensors together, divide by two, ship it. The reality is more nuanced, and the failure modes are sharp enough to sink a production deployment if you don't understand what's happening under the hood.
This is a practical guide to model merging for ML engineers who want to use it in production: what the methods actually do mathematically, when they work, when they silently degrade, and how to pick the right tool for a given set of constituent models.
What Model Merging Actually Does
The central idea is this: after fine-tuning, a model's weights encode the adaptation as a perturbation on top of the base model. If two fine-tuned models start from the same base and their perturbations don't directly contradict each other, you can average those perturbations and get a model that carries both adaptations.
This is distinct from ensembling. An output-space ensemble runs multiple models at inference time and combines their predictions — it costs proportionally more compute and memory. Model merging produces a single set of weights. At inference time it's identical in cost to running one model. The performance profile is also different: ensembles reduce variance across individual model errors; merging tries to accumulate the strengths of different fine-tuning trajectories into one weight space.
The constraint that matters: models being merged must share the same architecture. You can't merge a 7B model with a 13B model by simple weight arithmetic. Architecture must match layer-for-layer.
The Five Core Methods
Weight Averaging (Model Soup)
The simplest form is a linear average: θ_merged = Σ(wᵢ · θᵢ) / Σwᵢ. The "model soup" paper (Wortsman et al., 2022) showed that averaging the weights of multiple fine-tuned variants of the same model consistently outperforms any individual constituent, and approaches the performance of full output-space ensembles at zero inference cost.
Two variants matter in practice:
- Naive soup: average everything regardless of quality.
- Greedy soup: rank models by validation performance, add each one only if it improves the running average. This consistently outperforms naive averaging.
The limitation is that naive linear interpolation ignores interference between models. When two fine-tuned models have modified the same weight in opposite directions, averaging them does something meaningless — it drives that weight toward zero, losing the signal from both.
SLERP (Spherical Linear Interpolation)
Originally developed for quaternion-based animation in 1985, SLERP treats weight vectors as points on a high-dimensional sphere and interpolates along the geodesic rather than the chord. The formula:
θ = arccos(v₀ · v₁)
result = sin((1-t)·θ)/sin(θ) · v₀ + sin(t·θ)/sin(θ) · v₁
The difference from linear interpolation is that SLERP preserves the norm of the weight vectors throughout the path, avoiding the magnitude collapse that can happen when two weight vectors point in partially opposite directions. In practice this means smoother interpolation and better preservation of both models' representational geometry.
SLERP is limited to merging exactly two models at a time. For production use cases where you need to merge three or more specialists, you need to either chain pairwise merges (hierarchically) or move to a different method. It remains the most popular technique for pairwise merges, particularly for combining a general instruction-following model with a domain specialist.
Task Arithmetic (Task Vectors)
The task arithmetic paper (Ilharco et al., ICLR 2023) introduced a more principled framing. Define a task vector as the delta between the fine-tuned and base weights:
τᵢ = θ_finetuned_i - θ_base
Task vectors support arithmetic. Adding them combines capabilities. Negating one subtracts a capability. The multi-task merged model is:
θ_merged = θ_base + α · Σ τᵢ
The key insight is that movement in a task vector's direction through weight space consistently improves performance on that task. The vectors are structured, not random — and this structure makes arithmetic meaningful rather than coincidental.
The paper demonstrated cross-task transfer: applying a task vector for one task sometimes improves a related task for which no fine-tuning was done. This points to genuine semantic structure in how fine-tuning modifies weights, not just overfitting to task-specific patterns.
The failure mode is task interference: when two tasks modify the same weight regions in conflicting ways, summing their task vectors produces noise in exactly those regions.
TIES-Merging (Trim, Elect Sign, Merge)
TIES (Yadav et al., NeurIPS 2023) directly addresses the interference problem with a three-step process:
-
Trim: For each task vector, zero out all but the top-k% parameters by magnitude. Most task vector components are redundant — empirically, keeping only 20% of parameters by magnitude maintains nearly full performance. Eliminating low-magnitude parameters reduces noise in the merged result.
-
Elect Sign: For each weight position, look at the signs across all task vectors. Choose the sign held by the majority (weighted by magnitude). Parameters with minority signs are zeroed out.
-
Disjoint Merge: Average only the surviving (non-zero) parameters with aligned signs.
This is more expensive to compute than simple averaging, but the resulting merged models are substantially more robust when combining three or more fine-tuned models. The trim step alone — applied before any other merge method — consistently improves results.
DARE (Drop And REscale)
DARE (Yu et al., 2024) takes a different approach to interference reduction: randomly drop task vector parameters with probability p, then rescale the survivors by 1/(1-p) to maintain the expected value. The drop removes redundant parameters stochastically; the rescaling compensates for the reduction.
The empirical finding is striking: performance is preserved even when dropping 90–99% of task vector updates. Most fine-tuning parameters are redundant. DARE can be combined with TIES (as dare_ties) or used with linear merging (dare_linear). It's particularly useful when merging aggressively different specializations, where interference would otherwise be severe.
- https://developer.nvidia.com/blog/an-introduction-to-model-merging-for-llms/
- https://huggingface.co/blog/mlabonne/merge-models
- https://arxiv.org/abs/2212.04089
- https://arxiv.org/pdf/2306.01708
- https://sakana.ai/evolutionary-model-merge/
- https://arxiv.org/html/2403.13187v1
- https://cameronrwolfe.substack.com/p/model-merging
- https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/slm-model-weight-merging-for-federated-multi-tenant-requirements/4407315
- https://arxiv.org/html/2408.07666v4
- https://arxiv.org/html/2603.09938
- https://arxiv.org/html/2511.21437v1
- https://arxiv.org/html/2501.05559
