Bridging Strategy and Execution in the Age of AI Agents

Teams using AI coding tools are merging 98% more pull requests than they were a year ago. PR review time on those same teams has increased 91%. Bug rates are up 9% per developer. Pull request sizes have grown 154%.

The inner loop got fast. Everything around it didn’t.

The bottleneck moved

For most of the last decade, the binding constraint in software organisations was engineering capacity — how fast can we build? Sprint planning allocates scarce engineering time. Product managers write detailed specs to reduce build-rework cycles. Roadmaps sequence work because building is slow. The machinery of modern software delivery was designed to manage one bottleneck: we can’t build fast enough.

AI loosened that constraint. Not completely — design, debugging, and integration are still not autonomous — but the coding task that took a day takes an hour. The prototype that took a sprint takes a weekend. I’ve written before about where the bottleneck moved to: the outer loop. Deciding what to build, for whom, and why now. Reviewing what was built. Validating that it works. Governing how it’s deployed. All still running at pre-AI speed, with pre-AI processes, staffed at pre-AI levels.

Andrew Ng put it simply: writing software is becoming cheaper, which increases demand for people who can decide what to build. The industry is starting to notice — Atlassian’s Rovo is an explicit bet on the outer loop, bringing AI into project planning and knowledge discovery rather than just code generation. But most companies are still pouring investment into the inner loop while leaving the decision-making layer untouched. They’re optimising the part that’s already fast and ignoring the part that’s now the constraint.

The 91% increase in review time is the clearest signal. You wouldn’t double the throughput of a factory floor without upgrading the quality control line. But that’s exactly what most engineering organisations have done with AI coding tools. The result isn’t faster delivery. It’s faster production of code that takes longer to validate — and a review bottleneck that’s burning out the senior engineers who used to spend their time building.

I’ve seen this directly in multiple projects. A team celebrates their throughput numbers — PRs merged, stories closed, velocity charts trending up. Meanwhile, the two senior engineers who review everything are drowning. One leaves. The other stops reviewing carefully. The metrics stay green. The quality doesn’t. Nobody connects the departure to the tooling change because the dashboards don’t measure what was lost — they measure what was produced.

The pilot graveyard

For every 33 AI agent pilots launched across enterprises, roughly 4 reach production. That’s an 88% failure rate — and it’s not because the pilots didn’t work.

The pilots work beautifully in the sandbox. A demo agent that summarises customer tickets, triages incoming requests, generates code reviews — each one impressive in isolation. Leadership greenlights the pilot. The team builds it in a week. It runs for a month in a controlled environment and everyone’s excited.

Then it meets production. Who reviews the agent’s output before it reaches a customer? What happens when it hallucinates? How do you audit its decisions for compliance? What’s the fallback when the model is down? How do you monitor drift — the slow degradation in output quality that nobody notices until a customer complains? How do you version, test, and roll back an agent the way you’d version, test, and roll back a microservice?

These aren’t AI problems. They’re infrastructure problems — the same kind we solved for web services over two decades. But most teams building AI agents aren’t treating them as production software. They’re treating them as clever scripts that happen to call an API.

Gartner predicts over 40% of agentic AI projects will be cancelled by 2027 — not because the technology doesn’t work, but because of escalating costs, unclear business value, and inadequate risk controls. The gap is widening, not closing: ServiceNow’s enterprise AI maturity index dropped from 44 to 35 year-over-year, with fewer than 1% of organisations scoring above 50 on a 100-point scale. The capability of the models is growing faster than the organisational readiness to use them.

The pattern across my client engagements is consistent. The team builds an impressive agent in a week. It takes three months to figure out how to run it in production — and that’s if they can get the budget. Convincing leadership to fund three months of platform infrastructure before a single agent is deployed is one of the hardest sells in technology. The ROI of governance and observability is invisible until the day you need it. By then it’s a production incident, not a planning conversation.

The gap between demo and production isn’t incremental. It’s structural. And the companies that have crossed it didn’t build better agents. They built the infrastructure to operate them.

What gets you across

The companies that get agents to production have better infrastructure, not better models. Three things separate them.

They build the platform before the agent. Most teams start with the agent — what should it do? how should it reason? — and defer the infrastructure. The teams that reach production start with the platform: how do we enforce policies? How do we monitor behaviour? How do we roll back? A team I worked with built a customer-facing support agent in two weeks. It took four months to build the deployment pipeline, the evaluation framework, and the monitoring that made it production-worthy. The next agent took three days to deploy — because the platform already existed. The first agent is the most expensive. Every one after it is incremental. But you have to survive the first one, and most teams can’t justify the infrastructure investment before they have a production use case. The ones that do are the ones that make it.

They treat evaluation as continuous infrastructure. Building meaningful evaluations for AI agents is genuinely hard — you can’t unit-test a system whose outputs are non-deterministic. But the teams that reach production treat evals the way they treat monitoring: always running, always evolving. Anthropic’s own approach — which they’ve written about openly — graduates capability evaluations into regression suites that run continuously to catch drift. The shift is from “can we do this at all?” to “can we still do this reliably?” Most teams test their agents at launch and never again. The teams in production test them every day.

They govern with automation, not committees. The governance model that works for AI agents looks like the governance model that works for micro-services: automated policy enforcement, runtime monitoring, circuit breakers, rollback capability. It does not look like a review board that meets fortnightly to approve agent deployments. The review board creates the same friction dynamic I described in the dependency security post — and governance that fights the developer loses. Governance embedded in the deployment pipeline wins. The difference between “we review every agent deployment” and “every agent deployment is automatically validated against policy” is the difference between a process that scales and one that becomes the next bottleneck.

The bottleneck moved. Building is fast now — the market is continuously solving that with billions of dollars of investment. The hard part is everything the investment isn’t reaching: the review processes designed for human-speed output, the team structures built around engineering-as-bottleneck, the governance models that assume someone has time to check.

The organisations that bridge the gap won’t be the ones with the most agents. They’ll be the ones that restructured around the shift — moved the reviewers and product thinkers to where the constraint actually lives, built the infrastructure before the agents, and read the 91% review time increase as what it is: not a metric to optimise, but a signal that the whole system needs redesigning.

The rest will keep shipping twice as many pull requests, wondering why nothing feels faster.