Chaos Engineering for the Rest of Us

You think your system is resilient. You’ve added circuit breakers, retries, graceful degradation. You’ve designed for failure. But you’ve never actually tested whether any of it works under the conditions it was designed for — because testing failure feels reckless, and testing it in staging feels pointless, because staging doesn’t behave 100% like prod.

Every engineering team has a mental model of what happens when things break. The database goes down — the cache takes over. A pod dies — self-healing to desired replicas, load balancer reroutes, request is retried. The third-party API times out — the circuit breaker opens. Most of those mental models have never been verified. Some of them are wrong. The gap between what you believe about your system and what’s actually true is where your next outage lives.

Confidence, not destruction

Casey Rosenthal and Nora Jones define chaos engineering as “the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.” The key word is confidence — the confidence that comes from testing your assumptions rather than trusting them. Netflix built this into their engineering culture more than 15 years ago, and it proved its worth when a major AWS outage in 2011 barely affected them while competitors went dark. But the discipline itself doesn’t require Netflix’s scale. It requires a hypothesis and a way to test it.

The smallest chaos experiment isn’t a tool. It’s a sentence: “we believe the service recovers within 30 seconds when the database connection drops.” That’s a hypothesis. Test it. Time the recovery. See if you’re right. If you are, you’ve earned the confidence. If you’re not, you’ve found the gap before your users did.

According to Gartner’s 2026 IT Resilience Survey, 62% of engineering teams don’t do this because they fear it will cause disruption. That fear is the point. If you can’t safely test a failure mode, you can’t safely survive it in production. The experiment you’re afraid to run is the one you need most.

You need to see before you break

Charity Majors put it bluntly at Chaos Conf 2018: “Without observability, you don’t actually have chaos engineering, you just have chaos.”

The prerequisite isn’t Gremlin or LitmusChaos. It’s the ability to answer a single question after something breaks: what just happened? Not “which alert fired” — that’s monitoring. Observability is the ability to ask arbitrary questions about system state: which service was involved, what was the upstream request, what did the downstream dependency do, and what was the user impact? If you can’t answer those, you’re not ready for deliberate failure injection. You’re not even ready for the accidental failures you already have.

This is where small teams have an unexpected advantage. A five-person team with structured logs, distributed traces, and basic dashboards can run a chaos experiment and understand the result in minutes. A fifty-person team with poor observability will inject a failure and spend two hours figuring out whether the impact was from their experiment or from something else entirely.

I learned this the hard way on a trading platform. The architecture was designed for resilience — circuit breakers on every external dependency, graceful degradation as a first principle. What we hadn’t tested was what happened when the audit trail pipeline silently failed. Services were degrading gracefully where they should have been failing loudly. The back-office observability wasn’t wired to trigger an alarm on missing audit data. We found out when the audit database was empty — not because a dashboard told us, but because someone queried it manually and got zero rows back. In a regulated system where the audit trail is a compliance requirement, “graceful degradation” had quietly degraded the one thing that couldn’t be allowed to degrade. The system was designed to survive failure. Nobody had tested whether it would notice failure in the right places.

The experiment nobody runs

The most valuable resilience test for a small company isn’t technical. It’s organisational.

What happens when your Lead Engineer is unreachable for a week? Not on holiday with Slack notifications on — genuinely unreachable. “Gavin Belson at a Buddhist retreat” unreachable. Can anyone else deploy? Does anyone else know how the payment integration works? Is the incident response process documented well enough that the person who’s never been on-call can follow it?

Google has been running Disaster Recovery Testing — DiRT — since 2006. It’s a company-wide exercise that tests whether the organisation can respond when key people and their expertise are unavailable. Not just the infrastructure — the humans, the processes, the decision-making chains, the undocumented knowledge that lives in one person’s head. They’ve been doing this for twenty years because they learned that infrastructure resilience means nothing if the team that operates it can’t function under pressure.

I’ve seen the small-company version of this discovery. A team of eight where the founding tech guy handled every production deployment. Nobody else had the credentials, nobody else knew the sequence, nobody else had done it even once. The system had redundancy. The process had a bus factor of one. The infrastructure could survive a pod failure. The organisation couldn’t survive that guy catching the flu.

Gary Klein’s pre-mortem applies the same discipline to decisions: before starting a project, imagine it has failed spectacularly, then work backwards to identify why. It’s the resilience mindset applied to strategy — testing your assumptions about what will go wrong before you’ve committed to a path. The most uncomfortable findings always come from the organisational experiments, not the infrastructure ones. Killing a pod tells you whether your infra works. Testing whether your team can function without one specific person tells you whether your organisation works.

The Friday afternoon version

You don’t need a chaos engineering programme. You need thirty minutes on a Friday afternoon, one hypothesis, and a way to stop the experiment if it goes wrong.

Start in staging or in a non-critical production path — not on the hot path during peak traffic. Define what normal looks like before you break anything: request latency, error rate, throughput. That’s your steady state. The experiment succeeds or fails relative to that baseline.

Kill a pod and watch whether traffic reroutes. Not from the dashboard — from the traces. How long did the reroute take? Did the user see an error? Did anyone get alerted? If the answer to any of these is “I don’t know,” that’s the finding.

Point a staging service at a mock dependency that returns errors. Does the circuit breaker fire? Does the fallback path work? Does the error message make sense to the user, or does it surface a raw exception?

Revoke a team member’s deploy access for the afternoon (preferably with their knowledge). Can someone else ship? Is the runbook current, or does it reference infrastructure that was replaced six months ago?

Each experiment takes minutes. The conversation that follows is where the value lives — the distance between what you expected and what actually happened. That distance is your risk. And the goal isn’t a one-off Friday experiment but a habit: test one assumption, fix what you find, test the next one. The assumptions don’t run out.

Every system has failure modes nobody’s tested. The question is whether you discover them during a controlled experiment on a Friday afternoon — with a hypothesis, a steady-state baseline, and a kill switch — or during an incident at 3am on a Saturday, without the plan, without the team, and without the option to stop.