Resilient by Design: What I Learned Building for Regulated Finance

Most engineers optimise for uptime. Keep the system running. Retry the request. Swallow the error. Move on.

In regulated finance, that instinct will cost you.

Over the last six months, I’ve been building a back-office system for a MiFID II-regulated trading platform — the kind of system where a silent data inconsistency doesn’t just cause a bug report; it causes a regulatory investigation. Where “the system recovered automatically” isn’t reassuring — it’s terrifying, because nobody can explain what state it recovered to.

The first lesson took longer to learn than it should have: resilience isn’t uptime. It’s knowing exactly what happened, even when things went wrong.

Fail hard, recover deliberately

There’s a moment in every distributed systems project where you face a choice. The cluster state has diverged from the database. Do you auto-recover — pick a winner, reconcile, carry on? Or do you stop everything, alert an operator, and force a manual investigation?

Every instinct says auto-recover. Ship the fix. Keep the lights on.

Pat Helland — who spent decades building database systems at Microsoft and Amazon — has argued that “eventual consistency” is a term so vague it means something different to everyone who uses it. In finance, what happens during the inconsistency window is real money moving in the wrong direction.

We chose to fail hard. Deliberately. Not because we couldn’t build auto-recovery — we could — but because the cost of silent corruption dwarfs the cost of downtime.

The scale of the problem is worse than most people think. A study by Bairavasundaram et al. at NetApp found over 400,000 silent data corruption incidents across 1.5 million production disks over 41 months — and the top 1% of affected drives produced more than half of all recorded corruptions. Separately, Meta’s engineering team published research in 2021 on CPU-level corruption — a defect in a specific processor caused math.pow to return incorrect values, silently skipping files during a Spark decompression pipeline. Database rows went missing. Nobody noticed until the damage was done. And that’s infrastructure — not business logic, not money, not regulation. Just bits on silicon.

In a trading system, a single corrupted position record can cascade into incorrect margin calculations, wrong risk exposure, and trades that should never have been allowed. JPMorgan’s London Whale incident — $6.2 billion in losses — traced back in part to a spreadsheet error in a risk model. The system didn’t fail. It kept running. That was the problem.

So we built for CP — consistency and partition tolerance — over AP. The system chooses to be unavailable rather than inconsistent. When the event log diverges from the projection, the service stops. An operator investigates. The recovery is deliberate, auditable, and provable.

Martin Thompson’s foundational principle for the LMAX Exchange — which processes six million orders per second on a single thread — is that the execution of events changing state must be deterministic. If you can’t replay it and get the same answer, you don’t have a system. You have a hope.

Compliance as architecture

Here’s what I wish someone had told me before I started: if you’re building for a regulated industry, don’t bolt compliance on. Build it in. The architecture should make compliance the path of least resistance — like a well-designed building where the fire exits are exactly where you’d expect them, not retrofitted through a window.

Event sourcing does this naturally. Every state change is an immutable event in a log. The log is the audit trail — not a secondary system that might drift, not a reporting layer that reconstructs history from snapshots, but the actual sequence of things that happened. Event-sourced systems make every action traceable — not as a feature you add, but as a property of the design. Greg Young, who coined CQRS and popularised event sourcing, is characteristically direct about where it belongs: apply it selectively, in the places where it earns its complexity. Audit trails in regulated systems are that place.

Most consensus systems force you to choose. Throughput or consistency. Low latency or deterministic replay. Pick two, sacrifice the rest. Aeron Cluster doesn’t make you choose — it gives you all four, which is why the architecture works at all. The determinism isn’t a nice-to-have bolted onto a fast system; it’s intrinsic to how the cluster replicates state.

The in-memory replicated state machine is the source of truth for the entire system — not the database. All three cluster nodes persist the event log through Aeron Archive. A persistence gateway then writes domain events to TimescaleDB for querying and reporting. But if the database disappeared tomorrow, you’d rebuild it from the cluster log. The database is a materialised view. The cluster is the record.

Every node processes the same events in the same order using the cluster’s own clock — not System.currentTimeMillis(), which varies across machines, but a deterministic timestamp agreed by consensus. The result: any node can replay the full history and arrive at the same state. When a regulator asks “what happened at 14:32:07.445 on March 3rd?”, the answer isn’t reconstructed. It’s fully deterministic.

The harder problem was GDPR. MiFID II requires five to seven years of transaction data retention. GDPR gives individuals the right to erasure. These two requirements look contradictory — and they are, unless you design around them.

The solution is crypto-shredding, a technique Thoughtworks rates at “Trial” maturity and Mathias Verraes has written about extensively in the context of event sourcing. You encrypt sensitive data with a per-user key. When someone exercises their right to erasure, you destroy the key. The encrypted data remains — satisfying your retention obligation under GDPR Article 6(1)(c) — but it’s unreadable. Noise. Whether key destruction fully satisfies Article 17 erasure is still debated in legal circles — take advice for your jurisdiction — but the architectural pattern resolves the tension between “keep everything” and “delete on request” without forcing you to choose.

It’s elegant, but not free. You need a key management layer, a caching strategy, and careful thought about where keys live. The architecture adds complexity — but the alternative is choosing between regulatory compliance and data protection law. That’s not a trade-off; it’s a trap.

Graceful degradation by severity

Michael Nygard’s circuit breaker pattern — popularised in Release It! and documented by Fowler — is well understood. What’s less discussed is that not all circuit breakers should behave the same way.

In aviation, a single engine failure doesn’t ground the plane — but a hydraulics failure does. The response is proportional to the criticality of the system that failed. Software should work the same way.

In the back-office system, we protected three of our database connections with Resilience4j circuit breakers, each with different failure policies:

The audit database fails fast. If the circuit opens, the service declares itself not ready and Kubernetes marks it as Degraded. MiFID II requires a complete audit trail. But a gap in the audit log isn’t a degraded experience — it’s a compliance violation. Non-negotiable.

The operational database degrades gracefully. Circuit open? Return sensible defaults. The system continues to function — dealers can still do part of their job — but some queries return cached or default values until the database recovers.

The read-only analytics datasource? Return empty results. Historical queries can wait. Nobody’s position is at risk because a reporting query failed.

This is the part Nygard’s original pattern doesn’t prescribe: the severity of the response should match the criticality of the dependency. A circuit breaker isn’t just a binary switch. It’s a policy decision about what matters most when things go wrong.

We wired each circuit breaker’s state into the Kubernetes health checks. Audit breaker open → pod not ready (pull it from traffic). Analytics breaker open → pod degraded (keep serving, just without history). The orchestrator makes routing decisions based on what the application knows about its own health. That’s the integration point most teams miss.

Observability that earns its place

Gil Tene has a talk — “How NOT to Measure Latency” — that changed how I think about observability. His central argument: if your measurement tool introduces more latency than the thing you’re measuring, you’re not observing your system — you’re observing your observer.

His HDR Histogram library records values in 3–6 nanoseconds with zero allocation on the recording path. That’s the bar. In a system where the critical path is measured in microseconds, you can’t reach for a general-purpose telemetry SDK that allocates objects, buffers spans, and batches exports on the hot path.

We built custom tracing — W3C Trace Context compatible, so it integrates with standard tooling downstream, but with a fixed-size binary encoding and pre-allocated attribute arrays. Span creation costs roughly 50 nanoseconds. No allocations. No garbage collection pressure. Thompson coined the term “Mechanical Sympathy” for this kind of thinking — borrowed from racing driver Jackie Stewart — and it applies here directly: understand the hardware well enough to write software that doesn’t fight it.

The trade-off is real: you maintain your own tracing implementation instead of adopting OpenTelemetry wholesale. That’s engineering time, documentation, onboarding complexity. Every new hire asks “why don’t we just use the standard SDK?” and you have to explain — again — that the standard SDK allocates on every span, and in your system that’s not a rounding error, it’s a defect. But in a world where adding 100 microseconds of latency to the critical path is unacceptable, the alternative — reaching for a convenient SDK and hoping the overhead doesn’t matter — is the more expensive choice. You just don’t get the invoice until production.

James Hamilton, who designed much of Amazon’s early infrastructure, describes the goal as building services that “detect and recover from all but the most obscure failures without administrative intervention.” That’s the aspiration. But in regulated finance, I’d add a caveat: detect all failures, recover from most automatically, and for the ones that touch money or compliance — stop, alert, and let a human decide.

The system that fails loudly and recovers deliberately is more trustworthy than the one that silently patches itself. Your regulators will agree. More importantly, so will you — at 2 a.m., when something goes wrong and you need to explain exactly what happened.

Resilience isn’t the absence of failure. It’s the presence of a plan.