The Complexity Threshold

A solo AI agent beat a 5-agent team on a calculator. The obvious objection: try a harder task. So I did.

Setup: Same model (Gemini Flash). Same 5-agent org (PM, Tech Lead, 2 SWEs, QA). Three tasks at increasing interface complexity.
Judge: Claude Sonnet scored all outputs. Not Gemini judging itself.

The Scores

Calculator

1 file · 0 interfaces

Keyboard input, history, dark theme

Solo

8.9

Team

8.7

12×

3.9K vs 45K

Crossover

Project Mgmt App

1 file · 8 interfaces

Kanban, calendar, list view, CRUD, drag-and-drop, localStorage

Solo

7.9

Team

8.0

6×

9.5K vs 57K

Expense Tracker

14 files · 17 interfaces

Flask API, SQLite, auth, REST endpoints, frontend, tests

Solo

7.6

Team

8.2

9×

8.4K vs 74K

Two things happen as interface count rises: the quality gap widens (from 0.2 to 0.6 points) and the cost ratio drops (from 12× to 9×). The team's coordination overhead amortizes better when there's genuine architecture to coordinate.

Quality Breakdown (Claude Scores)

Calculator

Solo

8.9

Team

8.7

Functional99

UI/UX99

Errors88

Code98

Features109

Project Mgmt App

Solo

7.9

Team

8.0

Functional99

UI/UX88

Errors66

Code78

Features99

Expense Tracker

Solo

7.6

Team

8.2

Architecture89

API Match98

Data Model99

Features79

Errors67

Security46

Side-by-Side: The Actual Apps

Click a tab to compare. Calculator and PM App are interactive single-file HTML. Expense Tracker shows the file tree.

Calculator

PM App

Expense Tracker

Solo Agent19s · 3.9K tokens · 8.9

5-Agent Team191s · 45K tokens · 8.7

Solo Agent40s · 9.5K tokens · 7.9

5-Agent Team171s · 57K tokens · 8.0

Solo Agent32s · 8.4K tokens · 7.6

5-Agent Team233s · 74K tokens · 8.2

What This Means

The Interfaces Don't Break

The solo agent was surprisingly consistent across all 14 files — no wrong column names, no broken imports. But that consistency came at a cost. Keeping 17 cross-file contracts aligned consumed context budget that could have gone to edge cases and security.

The Cost Premium Shrinks

12× on the calculator, 6× on the PM app, 9× on the expense tracker. It's not a clean downward trend — but the team is no longer burning 12× tokens for a worse result. The overhead starts buying something real.

Working Memory Is The Bottleneck

Claude's scoring rubric flagged 3 critical bugs in the solo expense tracker: a date calculation that crashes in January, no XSS escaping, no CSRF protection. The team's QA caught all three. Specialization frees up working memory for quality.

The Decision Framework

Few interfaces? Ship solo — the overhead isn't worth it. Many interfaces? Specialize — not to prevent interface breaks, but to free up working memory for everything else.

How We Counted Interfaces

An "interface" = a contract where two components must agree on a name, shape, or behavior. If one side changes and the other doesn't update, the app breaks.

Calculator — 0 interfaces

Single namespace. All functions, variables, and DOM elements share the same scope. Nothing can get out of sync.

PM App — 8 interfaces

1. Task data schema (field names)
2. Kanban ↔ status mapping
3. Calendar ↔ date format
4. List ↔ sort/filter values
5. Modal form ↔ CRUD operations
6. Drag-and-drop ↔ state updates
7. localStorage ↔ serialization
8. Statistics ↔ computed aggregates

Expense Tracker — 17 interfaces

1. app.py ↔ config.py (settings) 2. app.py ↔ models.py (imports) 3. app.py ↔ auth.py (blueprint) 4. app.py ↔ api.py (blueprint) 5. models ↔ auth (User schema) 6. models ↔ api (column names) 7. API routes ↔ JS fetch URLs 8. API response JSON ↔ JS reads 9. API request body ↔ JS POST payloads 10. auth ↔ login form fields 11. auth ↔ register form fields 12. routes ↔ template variables 13. base.html ↔ child template blocks 14. CSS classes ↔ HTML 15. JS ↔ DOM element IDs 16. requirements.txt ↔ imports 17. API ↔ test assertions