Observability in the Hot Path: Logging Everything Without Slowing Anything Down

Your system processes orders in microseconds. Your logging framework adds milliseconds. That’s not observability — that’s a performance regression you chose to keep because you were afraid of flying blind.

The mistake most teams make is treating all observability the same. Every log line gets the same async appender. Every metric gets the same scrape interval. Every trace gets the same sampling rate. On a system where latency is measured in microseconds, that uniformity is the problem. The observability itself needs to be architected — tiered by latency budget, not applied uniformly. The principle applies whether you’re running a trading platform or an API that promises sub-100ms responses: not everything deserves the same level of observation.

Tier the observation

We built a trading platform where the hot path — order entry through execution — had a latency budget measured in microseconds. Single-digit milliseconds at worst. The observability couldn’t consume any of it. The solution wasn’t less observability. It was different observability at each layer. And Stefan did a great job with this one.

The hot path gets counters, not logs. On the critical execution path, the only observability is atomic counter increments written to a lock-free ring buffer — the same pattern we use for resilient system design. No string formatting. No serialisation. No memory allocation. One compare-and-swap operation on an uncontended cache line, and the application thread continues. A background thread batches and exports to Prometheus. The hot path never slows down for I/O.

The warm path gets structured logs. Risk checks, health monitoring, Aeron cluster consensus — these have latency budgets in milliseconds, not microseconds. Here we use async structured logging: JSON events written to a lock-free queue, consumed by a background thread that handles disk and network I/O. The key decision: the log pipeline itself has back-pressure. If the queue fills, events are dropped rather than blocking the warm path. Losing a log line is cheaper than adding latency to a risk check.

The cold path gets full traces. Post-trade compliance, audit trails, and analytics have no latency constraint beyond “before the auditor asks.” These get complete capture — every trade, every state change, every decision, written to a durable log. Batch-processed, stored for years. It’s best practice for regulatory confidence, and in many jurisdictions the expectation is moving toward it. The cold path can afford the overhead that the hot path cannot.

Observe the business, not just the system

The second mistake is observing only infrastructure. CPU usage, memory pressure, pod health — these tell you the system is running. They don’t tell you the system is correct.

The most important alert we built fires with zero delay on risk limit breaches. Not a 30-second evaluation window. Not a 1-minute for clause. Instant. A risk breach is a compliance event — the infrastructure equivalent of a fire alarm. It doesn’t wait to see if the fire is “sustained.”

Order processing latency gets a p95 threshold at 1 millisecond (warning) and 5 milliseconds (critical). Cluster leader elections trigger an immediate alert — not because they’re failures, but because they indicate instability that might precede one. Order throughput dropping below baseline triggers within seconds — because in a trading system, a throughput drop isn’t a capacity problem. It’s a revenue problem.

These are business metrics expressed as infrastructure. The dashboard doesn’t just tell you the system is up. It tells you the system is doing its job. If your observability can answer “is the infrastructure healthy?” but not “is the product working?”, you’re monitoring the engine without watching the road. For a trading platform, that’s order throughput and risk compliance. For an e-commerce system, it’s checkout completion rate and payment success. The metric changes. The principle doesn’t.

The failure mode that proves this: a cluster node failover works exactly as designed — traffic reroutes, no downtime, infrastructure dashboard stays green — but the failover triggers a brief window where incoming data is silently dropped. The system is healthy. The data pipeline is broken. If your observability only covers infrastructure, nobody notices until someone queries the output and finds a gap. The dashboard that leads with business metrics — throughput, fill rates, data completeness — catches this in seconds. The one that leads with CPU and memory never catches it at all.

Sample what’s routine, keep what’s interesting

You can’t trace every request on a high-throughput system — the trace backend would collapse under millions of events per second. The question is what you keep and what you discard.

Head-based sampling — deciding at the start of a request whether to trace it — is cheap but blind. It discards anomalies at the same rate as routine requests. Tail-based sampling — deciding after the request completes — catches the interesting cases but requires buffering the entire trace while the decision is made.

We use a hybrid. Head-based sampling at a low base rate for routine traffic. Tail-based sampling at 100% for anything that errors, breaches a latency threshold, or triggers a risk event. The trace backend is dominated by the cases you’ll actually reach for during an incident — not a uniform sample of routine traffic that tells you nothing you didn’t already know.

Emit wide, query narrow

The last piece was cardinality — and this is where Charity Majors changed how we think about it.

The traditional metrics model asks you to decide upfront which dimensions matter. You tag order latency with service, exchange, and instrument — and that’s three labels multiplying into thousands of time series. Add trader ID and the cardinality explodes. Prometheus documentation warns against unbounded cardinality growth in labels. So you pre-aggregate, drop dimensions, and lose the context you’ll need during the next incident.

The alternative — and the model we eventually moved to — is structured events. One wide event per request, carrying every field you might query: order ID, trader ID, instrument, exchange, venue, latency, risk result, fill rate, venue selection. No pre-aggregation. No label cardinality problem. The backend handles aggregation at query time. You emit everything and query what matters.

The trade-off is cost — wide events at high volume require a backend designed for it. Honeycomb is built on this model. Traditional time-series databases are not. The operational shift is real. But the debugging advantage is transformative: you can literally ask “show me all orders for this trader on this exchange in the last hour with latency above 10ms” — and you get an answer. With pre-aggregated metrics, that query is impossible because you dropped the dimensions at collection time.

The goal isn’t zero-overhead observability. That doesn’t exist. The goal is observation that’s proportional to what it’s watching — nanoseconds on the hot path, milliseconds on the warm path, and whatever it takes on the cold path where compliance is the constraint. The teams that get this right don’t observe less. They observe differently — and the first dashboard they open shows whether the product is working, not whether the servers are up.