Mixture-of-Agents (MoA): Improving LLM Quality through Multi-Agent Collaboration

The Mixture-of-Agents (MoA) framework is redefining how we push large language models (LLMs) to higher levels of accuracy, reasoning depth, and reliability—without the prohibitive cost of scaling a single massive model.
Instead of relying on one “jack-of-all-trades” LLM, MoA orchestrates a team of specialized models that collaborate in structured layers, refining outputs step-by-step. This approach is showing state-of-the-art (SOTA) results using even open-source models, surpassing top proprietary LLMs like GPT-4 Omni on multiple benchmarks.
Collaborativeness among LLMs. Why combine models at all? The MoA team found that many off-the-shelf LLMs improve when consulting each other’s answers. In experiments on the AlpacaEval 2.0 benchmark, models like LLaMA, WizardLM, and Qwen performed better (“win rate” against a GPT-4 reference) when given peer-model answers in addition to the prompt.
In Figure 1, each model’s win rate jumps (red bars vs blue) when it sees others’ responses – evidence that LLMs “inherently collaborate” and can refine or validate answers based on each other. Crucially, this holds even if the peer answers are worse than what the model would do alone. In other words, multiple perspectives help an LLM avoid blind spots. This insight prompted MoA’s design: a framework to harness the collective expertise of multiple models.
*Figure 1: The “collaborativeness” effect – LLMs score higher on AlpacaEval 2.0 when provided with other models’ answers (red) versus alone (blue). Even top models (e.g. Qwen 110B) benefit from collaborating with peers, motivating MoA.*The Mixture-of-Agents Architecture
The MoA Advantage
MoA tackles these issues with a structured, multi-agent approach:
- Layered design – Multiple agents per layer, each taking all previous outputs as input.
- Role specialization – Proposers: Generate diverse candidate answers. Aggregators: Merge and refine into a single, higher-quality output.
- Iterative improvement – Each layer builds on the previous, gradually boosting accuracy and coherence.
- Model diversity – Combining varied architectures reduces shared weaknesses.
- No fine-tuning required – Works entirely via prompt engineering.
Each agent is an LLM assigned one of two roles: Proposers or Aggregators.
Proposer agents
Generate candidate answers – they “excel at producing useful reference responses” that add context and diverse perspectives. A proposer might not give the best final answer itself, but it contributes valuable pieces of the puzzle.
Aggregator agents
By contrast, specialize in synthesizing and improving upon others’ outputs. A good aggregator can take a set of rough answers and merge them into a single high-quality response, maintaining or enhancing quality even if some inputs are weak.
Many models can act in either role – e.g. GPT-4, Qwen-1.5, and LLaMA showed strong performance both proposing and aggregating – while some (WizardLM) were notably better as proposers than aggregators. MoA leverages these strengths by assigning each model to the role where it excels.
Layered iterative refinement
MoA organizes agents into multiple layers (think of it like a small pipeline of models).
Figure 2 illustrates an example with 4 layers and 3 agents per layer. In Layer 1, n proposer agents independently generate answers to the user’s prompt. Their outputs are then passed to Layer 2, where another set of agents (which can reuse the same models or different ones) see all previous answers as additional context. Each layer’s agents thus have more material to work with – effectively performing iterative refinement of the response.
This process continues for a few layers, and finally an aggregator agent produces the single consolidated answer. Intuitively, earlier layers contribute ideas and partial solutions, while later layers combine and polish them. By the final layer, the answer is far more comprehensive and robust than any first-pass attempt.
Figure 2: Mixture-of-Agents architecture (simplified to 3 agents × 4 layers).
Proposers vs. aggregators in practice
A key design question is how to assign models to layers. The MoA paper suggests two criteria:
- (a) Performance – stronger models (higher single-model win-rates) are prime candidates for later layers
- (b) Diversity – use a mix of model types so each brings something unique.
In fact, they found heterogeneous models contribute far more than clones of the same model.
In MoA’s implementation, the final layer often has the single best model acting as aggregator, whereas earlier layers can be filled with a diverse set of proposers. Interestingly, experiments showed many top models can do both roles well, but some are much stronger in one role.
For example, WizardLM (a fine-tuned LLaMA variant) excelled as a proposer generating creative answers, but struggled as an aggregator to combine others’ content. GPT-4 (OpenAI) and Qwen-1.5 (Alibaba) were more versatile, performing well as both aggregator and proposer.
These insights can guide developers to choose an appropriate mix – e.g. use an open-source GPT-4-like model as aggregator, and have specialized smaller models propose answers (perhaps a code-specialized model, a reasoning-specialized model, etc., depending on the query domain).
Benchmarks: MoA Outperforms GPT-4 (with Only Open Models)
The MoA architecture was evaluated on several tough benchmarks, and the results are striking. Using only open-source models (no GPT-4 at all), MoA matched or beat the mighty GPT-4 in overall quality.
AlpacaEval 2.0 (Length-Controlled Win Rate)
- MoA w/ GPT-4o: 65.7%
- MoA (open-source only): 65.1%
- MoA-Lite (cost-optimized): 59.3%
- GPT-4 Omni: 57.5%
- GPT-4 Turbo: 55.0%
MT-Bench (Avg. Score)
- MoA w/ GPT-4o: 9.40
- MoA (open-source): 9.25
- GPT-4 Turbo: 9.31
- GPT-4 Omni: 9.19
FLASK (Skill-Based Evaluation) – MoA outperforms GPT-4 Omni in:
- Robustness
- Correctness
- Factuality
- Insightfulness
- Completeness
- Metacognition
Figure 3: Fine-grained evaluation (FLASK) radar chart. MoA (red dashed) vs GPT-4 (blue) across 12 skill dimensions. MoA outperforms GPT-4 on multiple fronts (e.g. factuality, insightfulness), with a mild verbosity penalty (lower conciseness). Qwen-110B (red solid) as the MoA aggregator alone trails behind on several skills, showing the multi-agent synergy boosts overall performance.
It’s important to emphasize MoA’s efficiency: those gains were achieved with open models that are collectively much cheaper than GPT-4. For example, one MoA configuration used 6 open models (like Qwen-110B, LLaMA-70B, etc.) across 3 layers and still only cost a fraction of GPT-4’s API usage.
The team also devised a lighter variant called MoA-Lite – using just 2 layers and a smaller aggregator (Qwen-72B) – which still slightly beat GPT-4 Omni on AlpacaEval (59.3% vs 57.5%) while being more cost-effective. In other words, even a pared-down MoA can surpass GPT-4 quality at lower cost.
How is this possible?
Essentially, MoA taps into the wisdom of crowds among models. Each agent contributes unique strengths – one might add knowledge, another checks consistency, another improves phrasing. The final result benefits from all their expertise.
An illustrative comparison was made between MoA and a naive LLM ranker ensemble. The ranker would simply generate multiple answers and have an LLM (like GPT-4 or Qwen) pick the best, without synthesizing them. MoA significantly outperformed that approach. This confirms MoA’s aggregator isn’t just picking one of the inputs; it’s truly combining ideas (the paper even found the aggregator’s answer has the highest overlap with the best parts of proposals, via BLEU score correlation). Collaboration, not just selection, is the key.
Cost, Flexibility, and Practical Insights
For developers, a major appeal of MoA is cost-effectiveness. By orchestrating smaller open models, you can achieve GPT-4-level output without paying API fees or running a 175B-parameter model for every query. The MoA authors provide a detailed cost analysis (see Figure 5).
MoA configurations lie on a Pareto frontier of quality vs cost – delivering high win-rates at much lower cost than GPT-4. For instance, one MoA run produced a 4% higher win-rate than GPT-4 Turbo while being 2× cheaper in terms of inference cost.
Even MoA-Lite (2 layers) achieved the same win-rate as GPT-4 Omni at equivalent cost, essentially matching GPT-4’s quality-per-dollar, and actually beat GPT-4 Turbo’s quality at half the cost. This opens the door for budget-conscious applications: you could deploy a set of fine-tuned 7B–70B open models that collectively rival or surpass a closed 175B model.
Figure 4: Performance vs cost and latency trade-offs. Left: LC win-rate (quality) vs API cost per query. MoA (blue/orange points along dashed gray frontier) achieves ~60–65% win-rate at a cost far below GPT-4 (red stars). Right: Win-rate vs inference throughput (in TFLOPs, proxy for latency). MoA again is on the Pareto frontier – combinations of smaller models efficiently reach high quality. “Single Proposer” uses one model to generate multiple answers; “Multi Proposer” (MoA) uses different models per layer, which is more compute-efficient by parallelizing agents.
Another advantage is flexibility. Since MoA works via prompting, you can dynamically scale the number of agents or layers based on the query complexity or available compute. Need a quick, cheap answer? Run MoA-Lite with fewer agents. Need maximum quality? Add a layer of a very large aggregator (even GPT-4 itself could be used in MoA as a final aggregator to push quality further).
The framework lets you mix-and-match open models as long as you can prompt them. This also means you can specialize agents: e.g. add a code-specific LLM in Layer 1 to propose a coding solution, a math-specific LLM to check calculations, etc., and have the aggregator merge their outputs. In the paper’s ablations, using diverse model types yielded significantly better answers than homogeneous agents – so diversity is worth leveraging.
Implementation tips
The authors have released their MoA code (Prompt scripts and model configs) on GitHub, making it easier to reproduce and adapt. https://github.com/togethercomputer/moa
To implement MoA, you would run each layer’s agents in parallel (to minimize latency), gather their outputs, and feed them (with an “aggregate” system prompt) into the next layer’s agents. No fine-tuning is required – just careful prompt engineering.
It’s wise to use length-controlled generation for agents (ensuring none rambles too long) to give the aggregator balanced inputs.
Also, when choosing models for each layer, consider using your strongest model as final aggregator (since it has to output the final answer) and smaller/more diverse models as proposers in earlier layers. The paper’s default MoA used 6 agents per layer for 3 layers: Qwen-110B as aggregator, and a mix of Qwen-72B, WizardLM 22B, LLaMA-3 70B, Mixtral 22B, and Mosaic’s MPT (dbrx) as proposers. That mix was chosen for strong base performance and heterogeneity.
Conclusion
Looking ahead, Mixture-of-Agents points to a new way of building AI systems. Instead of relying on one huge, all-purpose model, we can create a team of specialized models that work together in natural language. This is similar to how human teams operate. For example, in a medical setting:
- One agent might suggest possible diagnoses.
- Another could verify the findings against medical databases.
- A third (the aggregator) would combine everything into the final recommendation.
These agent ecosystems are often more robust and transparent. You can track each agent’s contribution, making it easier to understand and trust the final output.
Research shows that even today’s models, without extra training, can collaborate effectively. When they do, they can exceed the performance of any single model working alone.
For production AI, MoA offers a practical, cost-efficient path to GPT-4-level quality by combining open models instead of paying for one large, proprietary model.
As open-source LLMs keep improving, MoA-style architectures are likely to become the norm—scaling quality through collaboration rather than size. The era of “LLMs as team players” is just beginning.
Key Takeaways
- MoA boosts quality through collaboration – Multiple LLMs exchange and refine each other’s outputs, even when some inputs are weaker, leveraging the “collaborativeness” effect.
- Layered refinement – Each layer of agents sees prior outputs and the original prompt, enabling step-by-step improvement.
- Proven benchmark gains – Outperform more costly models
- Cost-effective – Matches or beats GPT-4 quality using cheaper open models; MoA-Lite offers strong results with lower compute.
- Flexibility – Easily swap in specialized models for domain tasks or adjust layers for speed vs quality.
- Future-ready – Represents a shift toward multi-agent AI systems that resemble expert teams, potentially becoming a standard approach for production-grade LLM deployments.
References: The Mixture-of-Agents architecture was introduced in Wang et al., 2024 (arXiv:2406.04692). https://arxiv.org/pdf/2406.04692