← All insights
Point of view01

Multi-agents: when it's useful, when it's pure overhead

Everyone is orchestrating agent swarms in 2026. Research shows it often makes things worse. Concrete criteria and a decision tree to get the architecture right.

Published April 27, 2026by Anthony Cohen
Multi-agentsArchitectureAI AgentsCrewAIOrchestration

In 2026, building an AI system often means building multiple agents. An orchestrator, specialized sub-agents, a validation agent, a memory agent. Frameworks come with names that sell the dream: CrewAI, AutoGen, LangGraph, Google ADK. The demos show five agents collaborating on a complex problem and producing a polished result in seconds.

The problem: production reality rarely looks like the demo. Teams that deployed multi-agent systems in 2024-2025 are coming back with mixed reports. Token costs three to ten times over budget. Latency that makes the system unusable for end users. Debugging that takes twice as long because no one can tell which agent started the drift. And cases where a single, well-prompted agent would have done exactly the same job.

This article starts from a simple question: when does a multi-agent system deliver real value, and when is it added complexity with no measurable benefit? We lay out concrete criteria, debunk a few myths, and offer a decision tree we actually use in our own engagements.

Why the multi-agent hype exists (and why it is partly justified)

Two real limitations of LLM agents made multi-agent architectures attractive.

The first is context window size. Even with 200,000-token contexts available today, some tasks naturally exceed what a single agent can hold in active memory: analyzing thousands of documents, traversing a million-line codebase, aggregating web research across a hundred sources in parallel. Distributing that work across multiple agents with separate context windows is a legitimate solution to a real structural constraint.

The second is parallelization. Some problems have independent sub-tasks that can run simultaneously. If you have five markets to analyze and each analysis is independent of the others, five parallel agents finish five times faster. That is sound engineering.

That is where the solid justification ends. The rest is mostly hype or over-engineering.

What research actually says

In December 2025, researchers from Google and MIT published Towards a Science of Scaling Agent Systems, the first serious attempt to derive quantitative scaling laws for agent systems. They ran 180 controlled experiments across four agentic benchmarks (Finance, BrowseComp, PlanCraft, Workbench) with three model families (OpenAI, Google, Anthropic).

The main finding: multi-agent coordination does not reliably improve outcomes and can actively degrade them. The paper quantifies four empirical effects, namely coordination efficiency, overhead, error amplification, and redundancy. Their model correctly predicts the best architecture for approximately 87% of task configurations, with an R² of 0.513. The key variable is what they call the "tool-coordination tradeoff": the more tools and variety a task requires, the more inter-agent coordination becomes a drag rather than an accelerator. Google summarized these findings in a dedicated research post noting that "adding agents does not produce monotonic performance improvements."

Anthropic documented a similar observation in their guide Building Effective Agents: multi-agent systems consume roughly 15 times more tokens than standard single-agent interactions. This is not a billing footnote. It is the direct consequence of how orchestration works. Every agent call in a chain carries the full conversation history up to that point. A four-agent debate over five rounds means at least twenty complete LLM calls, with contexts growing at every step.

Anthropic's recommendation is unambiguous: always find the simplest solution possible. A single, well-tooled agent is the default starting point. You add complexity only when a real constraint demands it.

The three situations where multi-agent is justified

1. Parallelization of genuinely independent sub-tasks

This is the strongest case. You have a problem decomposable into N sub-problems that do not need to communicate with each other to make progress. Each sub-agent runs in parallel, an orchestrator aggregates the results.

Concrete examples: parallel analysis of multiple markets or segments, simultaneous monitoring of several information sources, generation of content variants, distributed LLM regression testing. In these cases, the latency gain is real and the additional cost is justified.

The test: if sub-tasks need to pass intermediate results to each other, they are not truly independent. If they can run in any order without changing the outcome, they are.

2. Context window overflow on intensive processing tasks

For a task that requires the agent to process a volume of data exceeding what a single context can hold, distributing across multiple agents with separate windows is the only option. One agent that indexes and summarizes a hundred forty-page reports, another that synthesizes the summaries: that is a reasoned architecture.

To distinguish from cases where the context window is comfortable and everything could be done in a single call. In that case, distribution is over-engineering.

3. Regulatory or security isolation

In some contexts, two different trust levels cannot share the same execution context. An agent that processes sensitive data and an agent that communicates with an external system must be physically separated for compliance reasons (GDPR, high-risk AI Act categories, financial sector). This is not an AI architectural choice: it is a regulatory constraint that imposes the separation.

Outside these three cases, the default should favor a single, well-designed agent.

The five traps where multi-agent is pure overhead

Trap 1: specialization that specializes nothing

The classic CrewAI pattern: a "Researcher" agent, an "Analyst" agent, a "Writer" agent. Three well-named roles, three distinct prompts. But if all three agents use the same base model and the same tools, the specialization is cosmetic. The "Researcher" does not search better than a single agent with a good research prompt. The structure exists for the demo, not for performance.

Real specialization happens when each agent has different tool access (a distinct database, a proprietary API, a domain-fine-tuned model) or when it needs a fundamentally different context that is incompatible with the others.

Trap 2: sequential tasks dressed up as pipelines

A pipeline where agent B systematically waits for agent A before starting, which itself waits for agent C, is not parallelism. It is a sequential workflow with additional coordination complexity. Total latency is the sum of individual latencies plus orchestration overhead, which is worse than a single agent doing the whole thing.

How to recognize this pattern: if your agent graph has no parallel branches, it is a sequential pipeline. A single agent with multiple steps in its prompt is simpler, faster, and easier to debug.

Trap 3: opaque debugging and invisible drift

In a multi-agent system, when the final output is wrong, locating the cause is a problem in itself. Did agent A misunderstand the task? Did agent B receive a truncated context? Did the orchestrator aggregate incorrectly? Drift can amplify at each step in a way that is invisible. The Google/MIT paper explicitly measures this "error amplification" in decentralized architectures, where errors propagate and compound from one agent to the next.

A single agent with structured logs is auditable. A chain of five agents requires explicit instrumentation at every node to be debuggable. That is doable, but it is additional work that is not free.

Trap 4: costs that blow up in production

Roughly 18% token overhead for a basic CrewAI crew on well-defined tasks, 200% or more for conversational AutoGen systems with many debate rounds. For low-volume internal usage, that is manageable. For an agent answering thousands of requests per day, the cost difference between an optimized single agent and a five-agent crew can represent tens of thousands of dollars per year.

The economic evaluation must happen before the architecture decision, not after.

Trap 5: latency that breaks the user experience

LLM calls take two to six seconds on average depending on model and context length. A chain of five sequential agents accumulates ten to thirty seconds. For an internal tool used by experts who expect a deep response, that may be acceptable. For a real-time user-facing assistant, it is a dealbreaker. Single-agent systems respond 30 to 50% faster in comparable configurations.

CrewAI and AutoGen: what they do well, what they hide badly

CrewAI and AutoGen are the two most widely used multi-agent frameworks in 2025-2026. They have genuine merit, but they also sell abstractions that can mask underlying problems.

CrewAI's real strength is time-to-prototype. In under two hours, you have a working crew with well-defined roles, an orchestrator, and tools. That is useful for quickly validating whether the multi-agent paradigm actually adds something on your use case. The problem: if the demo works but token benchmarks blow up, migrating to a leaner architecture is expensive because business logic is baked into CrewAI abstractions.

AutoGen is more flexible and more powerful for complex conversational cases, specifically feedback loops and multi-agent debates. The trade-off is high operational overhead: API instability affected approximately 20% of legacy codebases during 2025 updates, and the average cost per request on production projects runs around $0.35 per multi-turn conversation, which rules out high-volume usage.

LangGraph, less hyped but more mature, remains the most defensible option for teams that want control: explicit state graph, clear debugging, no abstraction hiding the LLM calls.

The operational decision tree

Before building a multi-agent system, answer these six questions in order.

Question 1: can the problem be solved by a single, well-tooled, well-prompted agent? If yes, start there. Add agents later only if a concrete constraint requires it.

Question 2: are there truly independent, parallelizable sub-tasks? If no: stay with a single agent and a sequential workflow. If yes: move to question 3.

Question 3: does the latency or throughput gain justify the additional token cost and maintenance burden? Calculate the token cost of the multi-agent scenario against the single-agent scenario over a representative monthly volume. If the annual cost differential exceeds the measurable business benefit, reconsider.

Question 4: is the context window the actual limiting factor? If yes, and sub-tasks can fit into separate contexts: multi-agent architecture is justified. Otherwise: a single context window is enough.

Question 5: is there a regulatory constraint that mandates context separation? If yes: the architecture is imposed, not chosen.

Question 6: does your team have the instrumentation to debug and monitor multiple agents in production? If no: operational complexity will likely stall the project. Invest in observability first, or stay on a single instrumented agent.

What we do in GettIA engagements

When a client comes in with a brief that says "we want multi-agents," our first response is never "sure, let's code." It is "walk us through the use case in detail and we'll figure out together whether that's the right answer."

Across the projects we ran in 2025-2026, roughly one third of the cases where the client asked for multi-agents did not actually require it. A well-tooled agent with a structured prompt and a few domain-specific tools handled the need precisely, with lower latency, predictable cost, and trivial debugging.

The remaining two thirds genuinely justified a distributed architecture: parallel processing of heterogeneous sources, specialized agents with separate domain-specific RAG databases, separation between a data-collection agent and an action agent for traceability reasons.

Our process is systematic: single-agent proof of concept first, quality benchmark on a representative sample of real cases, cost evaluation at scale, identification of sub-tasks that would actually benefit from a distinct agent. If after that test the multi-agent approach delivers a measurable gain on at least one of the three axes (quality, latency, cost), we validate the architecture. Otherwise, we simplify.

Want us to look at your specific case? Book a slot, we'll spend 30 minutes reviewing your use case and determining whether a multi-agent architecture is justified or whether a well-designed single agent gets the job done.

Let's build together

Ready to
automate everything

We listen. We analyze. We build. With you.