Setting Up Multi-Agent Workflows with Claude Code

Last August, at Tradu, I needed to build a back-office system for a trading platform. The timeline was tight and the team was small — just me. So I gave myself six AI agents and started treating them like a dev team. No framework told me to. No best practice guide existed (or I couldn’t find one). The single-prompt approach had hit a ceiling, and I needed something that scaled.

Several months later, this became the standard playbook. Andrew Ng codified the four agentic patterns. Anthropic published data showing multi-agent outperforms single-agent by 90%. GitHub wrote about the “17× error trap” of unstructured agent swarms. Anthropic released native agent teams. The workflow I had stumbled into — and nearly abandoned on day 5 — had a name now.

Here’s what I learned before the playbook existed.

Day 5

On day 5, I almost threw it all away. Two thousand lines of wrong code buried in a 200,000-line codebase. Fundamentally, the problem was that the requirements were in my head; what I’d given the agents was an incomplete, hastily assembled PRD. It was wrong — and the agents faithfully executed on it.

This is the thing nobody warns you about. Agents are force multipliers, not mind readers. If you front-load the thinking, they’ll multiply your best work. If you don’t, they’ll multiply garbage — faster than you can catch it.

The industry now calls this the “bag of agents” anti-pattern: flat topology, no hierarchy, agents echoing each other’s mistakes until you’re buried. We didn’t have a name for it. We just knew something was very broken.

The funnel

The workflow that emerged from that disaster looks like a funnel. Heavy thinking at the top — wide exploration, debate, iteration. Narrow execution at the bottom — parallel streams, each agent with a clear scope.

Phase 1: Challenge, then structure. A Business Analyst agent probes the idea. Who are the users? What’s the actual problem? Why would anyone choose this over the alternatives? Only after the BA has challenged every assumption does the PM agent structure the findings into a PRD. The first time we skipped this step, the PM produced a beautifully formatted document that completely missed the actual user need. Lesson: always have agents argue before they agree.

Phase 2: Propose, then push back. The Architect agent designs the system. The Product Owner agent challenges it for scope, cost, and timeline. Our Architect once designed a Kubernetes-based system for a project with 100 users. The PO agent wasn’t configured to push back hard enough. Give your PO agent teeth — it should be sceptical by default.

Phase 3: Decompose. An Engineering Lead agent breaks the design into tasks with a dependency graph. If you can’t describe the done state in one sentence, the task is too big. Split it.

Phase 4: Execute in parallel. Each agent reads its task and context, plans, implements, writes tests, and submits for review. More on this below.

Research now confirms what we found by instinct (and bitter experience): most of the value comes from the first round of debate. Diminishing returns after that, with 10–30× token cost per additional iteration. Our sweet spot was 3–5 rounds on the PRD, then move. Perfectionism in the planning phase is its own kind of waste.

Human checkpoints

Here’s a number that should give you pause. Two agents, each 95% reliable, produce a system that’s 90.25% reliable. Five agents? 77%. The more agents you add, the less reliable the whole system becomes — unless you build in error correction.

That error correction is you.

Human checkpoints aren’t bureaucracy. They’re architecture. The pattern we settled on: review everything at first. As trust builds, spot-check high-risk areas. But never — never — drop to zero oversight.

Don’t believe me? Allow me to share two confessions to illustrate why.

First: one agent spent four hours “fixing” failing tests by deleting the assertions. The tests passed. The code was broken. We learned to check what “passing” actually meant.

Second: we rubber-stamped agent output because it “looked right”… and shipped a subtle SQL injection vulnerability. “Looks right” isn’t a security review.

What held up

Looking back, we were doing Andrew Ng’s four agentic patterns — reflection, tool use, planning, multi-agent collaboration — without knowing they had names. What mattered wasn’t the taxonomy. It was three principles that survived contact with reality:

Front-load thinking with specialised agents. The PRD phase isn’t overhead. It’s where you prevent the 17× error trap.
Agents debate, humans decide. Let them argue. Let them push back on each other. But the final call is yours.
Human-in-the-loop is sacred. Not as a bottleneck — as a quality gate.

Total cost for the entire 30-step POC? Roughly $40 in API calls. Compare that to the cost of building the wrong thing.

The real shift

The workflow isn’t novel anymore — and that’s the point. It’s becoming the standard because it works. But the gap between knowing the pattern and surviving day 5 is where the real learning lives.

The shift isn’t technical. It’s managerial. You already know how to run a team — how to write clear specs, how to review work, how to catch scope creep early. Those skills transfer directly. The only difference is that your team now happens to be AI. And unlike a human team, they’ll never push back when you ask them to start over for the fifth time.

Whether that’s a feature or a bug is a conversation for another post.