Why Single Agents Often Beat Multi-Agent Systems
~ 7 min read
Multi-agent systems can feel like the obvious next step in AI engineering.
If one agent is useful, then surely five specialised agents, plus a planner, a critic, and an orchestrator, must be better.
Sometimes they are. Quite often, they are not.
The mistake is treating “more agents” as the source of the improvement when the real source may simply be more test-time compute: more model calls, more tokens, more search, more retries, and more chances to land on the right answer.
This matters in production because the extra agents are not free. They add cost, latency, state management, evaluation complexity, and failure modes.
The benchmark trap
Many agent comparisons quietly test different compute budgets.
A single-agent baseline gets one pass at the task. A multi-agent system gets several agents, an aggregation step, maybe a critic, and sometimes a final repair pass. If that system wins, it is tempting to conclude that the architecture is better.
But the comparison may really be:
- one model call
- versus several model calls with far more reasoning tokens
That is not automatically an architectural win. It may just be a compute win.
Tran and Kiela’s 2026 paper makes this point directly. Under matched thinking-token budgets on multi-hop reasoning tasks, single-agent systems often matched or outperformed multi-agent systems. Their argument is not that multi-agent systems are useless. It is that you have to control for compute before you can claim the coordination structure is doing the work.
Before crediting the architecture, check whether the winning system simply spent more tokens and model calls.
That is a useful correction to how teams often evaluate agent prototypes.
Coordination is real overhead
A multi-agent system is not just “more intelligence”. It is also more coordination.
Every extra agent introduces another prompt, another model call, another partial interpretation of the task, and another handoff. The orchestrator then has to merge partial results into a coherent answer or action plan.
Those handoffs are lossy.
A single agent can keep the task, constraints, tool results, assumptions, and prior reasoning in one working context. A multi-agent system fragments that state. Agent A investigates one thing. Agent B investigates another. The lead agent receives summaries, not the full path each agent took through the problem.
That can be fine for broad research. It is less fine when small details decide correctness.
In coding, operations, data work, and incident response, a missing constraint can be the difference between a useful patch and a subtle regression. If the relevant detail gets lost in a summary, the next agent cannot reason from it.
Multi-agent systems are strongest when handoffs preserve the evidence that actually decides correctness.
More agents can amplify mistakes
Adding agents does not automatically create verification.
Sometimes it just creates more confident failure.
If three agents all misunderstand the same ambiguous requirement, the final answer may look stronger because multiple paths converged. But convergence is not evidence if all paths inherited the same weak framing.
The same thing happens with tool use. Multi-agent systems can duplicate searches, read stale state, race against each other, or produce incompatible intermediate outputs. The final merge step then becomes a reconciliation problem that the system may not be instrumented to solve.
This is why “add a critic agent” is not a complete safety plan. A critic that sees only a compressed answer may miss the tool result, edge case, or original constraint that mattered.
Verification needs access to the evidence, not just another role name.
The better baseline is a stronger single agent
Before adding agents, build a stronger single-agent baseline first.
That usually means:
- a clear task boundary
- well-described tools
- enough context to make the right decision
- explicit assumptions before action
- structured planning for non-trivial work
- validation before final output
- logging for tokens, tool calls, latency, and failure modes
This is also where test-time compute is worth spending deliberately.
Snell et al.’s 2024 paper on test-time compute is not specifically about multi-agent systems, but it supports the broader point: inference-time effort can be allocated in better or worse ways, and the best strategy depends on the problem. More computation helps most when it is applied to the right cases, with the right search or verification method.
In practice, that means a single agent with more thinking time, better tools, and a real verification loop may beat a messy collection of agents that mostly spend tokens coordinating with each other.
When a single agent fails
A single agent often fails because it answers too soon.
That does not mean the architecture is wrong. It may mean the prompt is not forcing enough pre-answer work or that the tool surface is unclear.
Before reaching for multiple agents, try making the single agent do the work you expected the group to do:
- Identify ambiguities.
- State assumptions.
- Break the task into steps.
- Inspect tool outputs before deciding.
- Check for contradictions.
- Run the relevant validation.
- Only then produce the final answer or patch.
That keeps the reasoning in one context while recovering many of the benefits people expect from collaboration.
It is less fashionable than an agent swarm, but it is often easier to debug.
When multiple agents are worth it
The right lesson is not “never use multiple agents.”
Multi-agent systems are useful when the problem genuinely breaks down into independent parts.
They make sense when independent work can happen in parallel, when the context is too large or noisy for one agent, or when specialist roles give you real coverage rather than theatre.
Large research tasks are the cleanest example. Anthropic’s production research system uses an orchestrator-worker pattern so subagents can explore different directions in parallel. That is a good fit because the task is broad, the search space is large, and parallelism can reduce wall-clock time.
Other good use cases include:
- evidence gathering across independent sources
- security review where separate threat paths can be explored in parallel
- legal or financial analysis where specialists inspect different documents or assumptions
- generator and verifier setups where the verifier sees the original evidence
- long-running workflows where separate agents own clearly bounded subtasks
The common thread is independence. If the agents are mostly waiting on each other, summarising each other’s work, or re-reading the same evidence, you may be paying coordination overhead without buying much capability.
The useful question is not whether multiple agents sound more advanced, but whether the workload proves it needs them.
A practical decision rule
Use a single agent by default when:
- the task fits in one context window
- the workflow is mostly sequential
- tools are stateful or expensive
- latency and cost matter
- correctness depends on small details
- failures look like shallow reasoning rather than missing parallelism
Use multiple agents when:
- subtasks are genuinely independent
- parallelism materially reduces elapsed time
- context is too large for one working memory
- specialist roles have distinct evidence or validation criteria
- you can measure a quality gain at equal or acceptable cost
- the orchestration layer is observable enough to debug
That last point matters. If you cannot explain why the multi-agent system beat the single-agent baseline, you do not yet have an architecture. You have a more expensive experiment.
The engineering principle
The best agent architecture is the one with the best ratio of usefulness to complexity.
For production systems, that means looking at:
- accuracy
- reliability
- latency
- cost
- observability
- ease of debugging
- operational failure modes
Multi-agent systems can be powerful, but they are not a free upgrade. They are a trade-off.
Start with the simplest agent that can do the job. Instrument it. Improve the prompt, tools, context, and validation loop. Then add agents only when the workload proves it needs parallelism, specialisation, or independent verification.
Most teams will get further by building one careful agent than by building five busy ones.
References
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets, Dat Tran and Douwe Kiela, 2026.
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar, 2024.
- Building effective agents, Anthropic, 2024.
- How we built our multi-agent research system, Anthropic, 2025.
- A practical guide to building agents, OpenAI.