The Complexity Threshold

A solo AI agent beat a 5-agent team on a calculator. The obvious objection: try a harder task. So I did.
Setup: Same model (Gemini Flash). Same 5-agent org (PM, Tech Lead, 2 SWEs, QA). Three tasks at increasing interface complexity.
Judge: Claude Sonnet scored all outputs. Not Gemini judging itself.

The Scores

Task
Solo Agent
5-Agent Team
Token Cost
Calculator
1 file · 0 interfaces
Keyboard input, history, dark theme
Solo
8.9
Team
8.7
12×
3.9K vs 45K

Crossover
Project Mgmt App
1 file · 8 interfaces
Kanban, calendar, list view, CRUD, drag-and-drop, localStorage
Solo
7.9
Team
8.0
9.5K vs 57K
Expense Tracker
14 files · 17 interfaces
Flask API, SQLite, auth, REST endpoints, frontend, tests
Solo
7.6
Team
8.2
8.4K vs 74K

Two things happen as interface count rises: the quality gap widens (from 0.2 to 0.6 points) and the cost ratio drops (from 12× to 9×). The team's coordination overhead amortizes better when there's genuine architecture to coordinate.

Quality Breakdown (Claude Scores)

Calculator

Solo
8.9
Team
8.7
Functional99
UI/UX99
Errors88
Code98
Features109

Project Mgmt App

Solo
7.9
Team
8.0
Functional99
UI/UX88
Errors66
Code78
Features99

Expense Tracker

Solo
7.6
Team
8.2
Architecture89
API Match98
Data Model99
Features79
Errors67
Security46

Side-by-Side: The Actual Apps

Click a tab to compare. Calculator and PM App are interactive single-file HTML. Expense Tracker shows the file tree.
Calculator
PM App
Expense Tracker

Solo Agent19s · 3.9K tokens · 8.9

5-Agent Team191s · 45K tokens · 8.7

Solo Agent40s · 9.5K tokens · 7.9

5-Agent Team171s · 57K tokens · 8.0

Solo Agent32s · 8.4K tokens · 7.6

5-Agent Team233s · 74K tokens · 8.2

What This Means

The Interfaces Don't Break

The solo agent was surprisingly consistent across all 14 files — no wrong column names, no broken imports. But that consistency came at a cost. Keeping 17 cross-file contracts aligned consumed context budget that could have gone to edge cases and security.

The Cost Premium Shrinks

12× on the calculator, 6× on the PM app, 9× on the expense tracker. It's not a clean downward trend — but the team is no longer burning 12× tokens for a worse result. The overhead starts buying something real.

Working Memory Is The Bottleneck

Claude's scoring rubric flagged 3 critical bugs in the solo expense tracker: a date calculation that crashes in January, no XSS escaping, no CSRF protection. The team's QA caught all three. Specialization frees up working memory for quality.

The Decision Framework

Few interfaces? Ship solo — the overhead isn't worth it. Many interfaces? Specialize — not to prevent interface breaks, but to free up working memory for everything else.

How We Counted Interfaces

An "interface" = a contract where two components must agree on a name, shape, or behavior. If one side changes and the other doesn't update, the app breaks.

Calculator — 0 interfaces

Single namespace. All functions, variables, and DOM elements share the same scope. Nothing can get out of sync.

PM App — 8 interfaces

1. Task data schema (field names)
2. Kanban ↔ status mapping
3. Calendar ↔ date format
4. List ↔ sort/filter values
5. Modal form ↔ CRUD operations
6. Drag-and-drop ↔ state updates
7. localStorage ↔ serialization
8. Statistics ↔ computed aggregates

Expense Tracker — 17 interfaces

1. app.py ↔ config.py (settings)   2. app.py ↔ models.py (imports)   3. app.py ↔ auth.py (blueprint)   4. app.py ↔ api.py (blueprint)   5. models ↔ auth (User schema)   6. models ↔ api (column names)   7. API routes ↔ JS fetch URLs   8. API response JSON ↔ JS reads   9. API request body ↔ JS POST payloads   10. auth ↔ login form fields   11. auth ↔ register form fields   12. routes ↔ template variables   13. base.html ↔ child template blocks   14. CSS classes ↔ HTML   15. JS ↔ DOM element IDs   16. requirements.txt ↔ imports   17. API ↔ test assertions