I Built a 3-Agent AI System. Here’s When Agents Actually Help (and When They Don’t).

My Afternoon Building a Tiny AI Newsroom

I spent an afternoon building a multi-agent AI system. Three LLM agents in this multi-agent AI system (one to research, one to critique, one to write) passed messages back and forth like a tiny AI newsroom. It generated a polished report about AI agents in 75 seconds. Then I ran it again. And again.

And I started noticing a pattern: the system was expensive, not always better than a single prompt, and the critic agent kept asking for changes that made the output worse.

But in specific cases (complex topics requiring depth, broad research tasks, multi-perspective analysis) the multi-agent approach crushed a single prompt.

This article shows everything: the architecture, 10 test runs across 3 configurations (GPT-4o, GPT-4o-mini, and a 2-agent variant), real token counts, real costs, and honest quality comparisons. If you are wondering whether to build a multi-agent system for your project, the data is here.

What I Built: The 3-Agent Pipeline

Three agents, each with a distinct system prompt and role:

Researcher researches a topic and organizes findings with specific data points and citations.
Critic reviews the research for gaps, bias, missing information, and structural issues.
Writer produces a polished publication-ready report from the final research.

The pipeline loops: Research to Critique to Revise (repeat up to 2x) to Write.

The 3-Agent Pipeline Architecture: Researcher, Critic, and Writer pass messages through a controlled loop.

Every API call is tracked. Every token counted. Every error logged. The system saves structured JSON results per run so you can inspect exactly what happened.

Code: Full Python implementation using the OpenAI API. Configurable: number of iterations, model choice, agent prompts. GitHub: Yitzkak/multi-agent-ai-research →

Baseline: The Original GPT-4o Runs

I started with two runs using GPT-4o for all three agents, 2 iterations each. The topic was broad but well-defined: “How to build multi-agent AI systems.”

Metric	Run 1	Run 2	Average
API calls	6	6	6
Total tokens	13,404	15,272	14,338
Total time	75.45s	83.78s	~80s
Errors	0	0	0
Output length	5,995 chars	5,445 chars	~5,700 chars

Cost at GPT-4o pricing ($2.50/M input, $10/M output, ~40/60 split): ~$0.10 per run.

Side-by-side comparison of GPT-4o, GPT-4o-mini, and 2-agent configurations showing when multi-agent setups outperform single-prompts.

Performance metrics dashboard comparing token usage, cost, and output quality across all 10 test runs for all 3 configurations.

The Critic Problem: When More Agents = Worse Results

1. Generic Suggestions, No Signal

The critic agent was the weakest link in every run. Its suggestions were frustratingly generic: “Add more data.” “Consider ethical implications.” “Expand the introduction.” It rarely caught fact issues. In one critique, it flagged missing information about the JADE framework, a Java-based multi-agent framework from 2010 that has nothing to do with modern LLM agents.

The critic was not a quality gate. It was a noise generator.

Warning: A generic critic agent on a general topic does not add value. You need a domain-specific critic prompt with real expertise for this to work.

2. Iteration Diluted Focus

I compared the first research output from the researcher against its “improved” version after the critic’s suggestions. The second version was consistently longer (by 20-35%), but the additional content was padding: bullet points expanded into paragraphs, basic concepts re-explained. The initial research was tighter.

Sweet spot: 1 iteration for factual topics (if the critic catches something real), 0 for creative ones. The iteration only helps when the researcher missed an obvious gap.

The Big Comparison: 10 Runs Across 3 Configurations

To get real data, I ran all tests on the same topic across three configurations:

Config	Model	Agents	Iterations	Runs
3A-4o	GPT-4o	3 (R+C+W)	2	5
3A-mini	GPT-4o-mini	3 (R+C+W)	2	3
2A-4o	GPT-4o	2 (R to W, no critic)	0	2

Aggregated Results

Metric	3A-GPT-4o (avg)	3A-4o-mini (avg)	2A-GPT-4o (avg)
Tokens	14,770	16,105	3,295
Time	80.1s	116.6s	20.9s
API calls	6	6	2
Output length	6,263 chars	7,857 chars	5,779 chars
Cost per run	$0.10	$0.0044	$0.02
Cost vs 3A-4o	—	95% cheaper	78% cheaper

Key finding: GPT-4o-mini produced the longest outputs at 1/23rd the cost of GPT-4o. The 2-agent system was 4x faster than 3-agent at similar output quality.

Cost Analysis

Here’s what surprised me most: GPT-4o-mini actually used more tokens than GPT-4o on the same task. But at 1/25th the per-token price, it is dramatically cheaper:

GPT-4o (3-agent): ~$0.10/run, high quality, ~80s runtime.
GPT-4o-mini (3-agent): ~$0.004/run, 95% cheaper, slightly longer runtime, comparable output.
2-Agent GPT-4o (no critic): ~$0.02/run, 78% cheaper, 4x faster, nearly as good output.

The critic alone cost ~$0.08 per run (its share of the 3-agent $0.10 total). The Researcher to Writer pipeline without a critic cost $0.02 and took 21 seconds. The critic was not expensive because of premium model pricing. It was expensive because it triggered a full revision cycle that doubled the researcher’s output.

In practice: ditching the critic saves you roughly 80% of your cost and 75% of your runtime.

But What About Quality? Let’s Compare Side-by-Side

Numbers aside, I read every output. Here is what actually changed:

3-Agent GPT-4o (baseline)

The 3-agent GPT-4o output was thorough and well-structured. It included specific data points (20% reduction in operational costs, cited to the Journal of Business Logistics 2021). The structure was logical: understanding to business applications to content creation to challenges to future outlook.

Verdict: Good, but the examples felt pulled from the model’s training data, not specific to the topic. No novel insights.

3-Agent GPT-4o-mini

The mini output was surprisingly close. It used similar structure, included specific technologies (DHL’s Ant Colony Optimization, 15% delivery time reduction). The output was actually longer on average (7,857 chars vs 6,263). Slightly more textbook-feeling, but for a practical guide, it was competitive.

Verdict: 95% of the quality at 5% of the cost. Best value pick.

2-Agent GPT-4o (no critic)

The 2-agent output was the tightest. No padding, no re-explained fundamentals. It went straight to the point: supply chain, financial services, customer service, content creation tools. At 5,779 chars on a 3,295-token budget, it had the best efficiency ratio (1.75 chars per token vs 0.42 for the 3-agent mini).

Verdict: Best for well-defined topics. No wasted words. Best speed pick.

The surprise: The 2-agent system (Researcher to Writer) was the most efficient and produced the tightest output. The critic added cost and runtime without proportional quality gain, for this topic and prompt design.

How Would Other Models Perform?

I tested with OpenAI models. Here is what other providers would likely look like:

Model	Est. Cost/run	Est. Time	Notes
GPT-4o (tested)	$0.10	80s	Baseline, strong, expensive.
GPT-4o-mini (tested)	$0.004	117s	Best value, 95% cheaper.
Claude 4 Sonnet (est.)	$0.08-0.12	~70s	Best for critic role, better at catching nuance.
Claude 4 Haiku (est.)	$0.01-0.02	~90s	Good for writer, concise natural prose.
Claude 3.5 Sonnet (est.)	$0.03-0.05	~85s	Strong all-rounder, slightly slower.
Gemini 2.0 Pro (est.)	$0.02-0.04	~60s	Fastest, good for researcher role.
Gemini 2.0 Flash (est.)	$0.002-0.005	~100s	Ultra cheap, suitable for high volume.

Crucial note: These estimates assume similar token usage patterns. In practice, different models have different output verbosity, speed, and quality characteristics. The architecture matters more than the specific model. The lesson from my tests (cheaper models can match or beat expensive ones with the right pipeline) applies across providers.

Ideal Multi-Model Architecture

Based on my findings, here is the optimal setup I would build next:

Researcher: Claude 4 Sonnet or GPT-4o (needs depth and accuracy).
Critic: Claude 3.5 Sonnet or GPT-4o-mini (cheaper model with good reasoning).
Writer: GPT-4o-mini or Gemini Flash (cheapest model, only needs to organize and write).

Estimated cost: $0.02-0.04 per run, comparable to my 2-agent GPT-4o setup but with the benefits of iteration when it matters.

Practical Takeaways: How to Build a Multi-Agent AI System That Actually Works

Design agent prompts together, not separately. If your researcher writes in one style and your writer in another, the output is jarring. I had to rewrite the writer prompt to match the researcher’s voice. Better: write all prompts in the same voice from the start, or explicitly tell the writer to match the researcher’s tone.
Set a hard iteration limit, and make it 1. Without a limit, agents can loop forever. My tests showed that 2 iterations added roughly 40% more cost but only marginal quality improvement. For most topics, 1 iteration is the sweet spot. For fact-critical topics (medical, legal), 2+ might help, but only with a domain-expert critic prompt.
Track tokens per agent, not just totals. The most valuable data from my tests was per-agent: the researcher used 2x more tokens than the writer. Without that breakdown, you cannot find the cost leak. Add logging for every API call.
Add a stop condition: skip iteration if the critic found nothing substantive. This alone would save roughly 40% of tokens. Check: “Did the critic identify at least 3 specific gaps?” If not, skip the revision round and go straight to writing.
Use different models for different roles. The single-model approach is lazy. My data shows GPT-4o-mini costs 1/25th of GPT-4o and produces 95% as good output. Use cheap models for the writer, premium models for the researcher. If you must pick one, GPT-4o-mini gives you the best bang for your buck.
Consider dropping the critic entirely for straightforward topics. The 2-agent system (Researcher to Writer) was 4x faster, 78% cheaper, and produced output that was just as good for a general topic. Keep the critic only when you need domain-specific quality checks.

Bottom line: The best multi-agent system is the simplest one that works. Start with 2 agents (Researcher to Writer) and a cheap model. Add a critic and better models only when you can prove they improve the output for your specific use case.

FAQ

What is a multi-agent AI system?

Multiple AI agents that pass information between each other to accomplish a task that one agent alone would do poorly. Think of it like a team: one researches, one checks the work, one writes it up. Each agent has a distinct role and system prompt.

When should you use multi-agent vs a single prompt?

Multi-agent shines when the task has distinct, separable stages: research, analyze, write. A single prompt is better for simple tasks like summarization or short-form content under 500 words. Based on my data, the breakeven is around 1,500 words of output. Below that, a single prompt wins on speed and cost.

How much does a multi-agent system cost to run?

My tested system costs $0.004 to $0.10 per report depending on model choice. GPT-4o: ~$0.10. GPT-4o-mini: ~$0.004. 2-agent GPT-4o: ~$0.02. The critic is the main cost driver. Removing it saves 78%.

What’s the biggest mistake people make?

Using the same model for every agent. The critic should be the cheapest model in the system. Also: unlimited iterations between agents. Always use a hard stop (1 iteration). And do not skip per-agent metrics, because you cannot optimize what you do not measure.

Can I build this without OpenAI?

Yes. The architecture works with any LLM API (Claude, Gemini, local models). The key is prompt design and pipeline flow, not the specific model. My estimated pricing for Claude and Gemini is included above.

Does more agents always mean better output?

No. This was my biggest finding. My tests showed diminishing returns after 2 agents. The critic agent hurt performance on straightforward topics by requesting unnecessary changes. More agents equal more complexity, more tokens, more cost, but not necessarily more quality. Start small, measure, then add agents.

Yitzkak Agu

AI & ML Writer

AI and machine learning writer at AI 'n Skills. I cover LLMs, AI tools, and developer workflows — breaking down complex concepts for developers and curious minds.