Operating a Live Trading System: What Breaks and What Saves You

Every hour you spend on trading logic should require two hours on what breaks at 3 a.m. That ratio sounds wrong until you’ve operated a live system. Then it sounds optimistic.

Operational failures kill more personal trading systems than bad strategies. The strategy can be sound, the backtest profitable, the risk model conservative — and none of it matters when the WebSocket disconnects, the exchange enters maintenance, and your system doesn’t notice for forty-five minutes.

The architecture that prevents those failures is a separate decision. This post is about what happens in operation — what actually breaks, and what saves you.

Orphan legs

You’re running a hedged strategy. Two positions — one long, one short — that cancel each other’s market exposure. The system opens the first leg. The second leg fails. A timeout, a rejected order, a momentary API glitch. You’re now holding a naked directional position in a volatile market.

This is the failure mode that keeps hedged-strategy operators awake. The defence is a strict timeout: if the second leg doesn’t fill within a defined window, the system market-closes the filled leg, halts the strategy, and alerts. No waiting. No retrying. Close the filled exposure first, investigate second.

The instinct is to retry the second leg. Resist it. A retry that succeeds thirty seconds later means you spent thirty seconds unhedged. In crypto, thirty seconds is enough for a 2% move against you. Close first, then figure out what went wrong.

WebSocket disconnects

Your connection to the exchange will drop. Not if — when. The question is what your system does in the gap.

The cascade should be proportional to the duration. A brief disconnect — seconds — is noise. Cancel open orders as a precaution, but don’t panic. A sustained disconnect — a minute or more — is a different situation entirely. If you can’t see the market and your REST fallback is also degraded, you can’t manage your positions with any degree of accuracy. The system should enter emergency mode: close everything, halt, and alert. If REST remains available, you have a degraded mode — slower, poll-based visibility — but not an emergency. The distinction matters: treating every WebSocket drop as a crisis is expensive in spread costs and fees.

The thresholds between “brief” and “sustained” are a design decision that depends on your strategy’s time horizon. A high-frequency system might treat 500 milliseconds as an emergency. A basis trading system holding positions for weeks has more tolerance. Duration isn’t the only factor. A brief disconnect with no open positions is an inconvenience. The same disconnect with hedged positions approaching a funding rate settlement window is an emergency regardless of how short it is. But every system needs the cascade defined before it goes live. Discovering your disconnect policy during a disconnect is too late.

Position reconciliation

Your system maintains local state — what it thinks your positions are. The exchange maintains its own state — what your positions actually are. These two will diverge.

Order fills that arrive out of sequence. Partial fills your system didn’t expect. A position closed by the exchange’s own risk engine without your system knowing. The drift is usually small and temporary. Occasionally it isn’t.

Reconcile frequently. Compare your local state against the exchange’s reported positions at a fixed interval — every few minutes, not every few hours. Increase the frequency around funding rate settlement windows — a mismatch minutes before settlement has a different urgency profile than one hours away. When they don’t match beyond a tolerance, halt and alert. The temptation is to auto-correct — update local state to match the exchange. Don’t. A mismatch means you don’t know which source is correct — the exchange could have a reporting lag, or your close order may not have processed. Silently accepting either interpretation is dangerous. Halt until a human has seen the exchange’s order history and confirmed ground truth.

Clock drift

Your system timestamps events. The exchange timestamps events. If your clocks disagree, your logs become unreliable — and unreliable logs during an incident are worse than no logs, because they’ll mislead your investigation.

Sync to NTP. Check drift periodically. If drift exceeds an acceptable threshold, halt and alert. Meridian isn’t a high-frequency system — both strategies tolerate delays — but even sub-second drift makes log correlation unreliable during an incident, and that’s when reliable logs matter most.

News sentiment

Markets react to events before your strategy’s signals update. A regulatory announcement, an exchange hack, a geopolitical event — these move prices faster than any technical indicator can register.

Meridian runs a news sentinel — an automated system that monitors news feeds, scores them for relevance and severity, and triggers alerts or emergency procedures for critical events. It’s not predicting the market. It’s detecting the kind of event where the correct response is to close everything and step away, regardless of what your strategy signals say.

Meridian’s news sentinel combines keyword matching on a curated watchlist with an embeddings layer that scores semantic similarity to known crisis patterns. The keyword layer is fast and catches obvious triggers. The embeddings layer catches events that don’t match any keyword but resemble past crises in structure — the kind of article that preceded FTX or Terra. Neither layer is perfect. The events that slip past both are why the other five layers exist. A false positive costs you a paused system and a few minutes of investigation. A missed critical event costs you money.

What saves you

Dual alerting

A single notification channel is a single point of failure. If your Slack integration fails during an incident, you hear nothing. Meridian fires alerts through both Slack and email simultaneously — independent channels, independent infrastructure. Both always fire. If one fails, the other still reaches you.

Observability that earns its place

Metrics, logs, and dashboards aren’t optional for a live trading system. Prometheus for metrics, Grafana for dashboards, a log aggregator for everything else. The Meridian status page is where you first go during an incident. And then Grafana — the logs are what tell you what actually happened versus what you think happened.

The rule is simple: if it’s not logged, it didn’t happen. Every state transition, every order, every fill, every alert, every circuit breaker trip.

Monitor the monitoring. If your metrics pipeline goes down silently, the dashboard shows stale data — and stale data during an incident is worse than no data, because you trust it. Meridian runs meta-health checks on every component; if any part of the system goes down, alerts fire to both Slack and email independently of the metrics pipeline.

Testing your emergency procedure

This is the most important operational advice I can give. Before going live, trigger your emergency procedure in paper trading. Watch it actually close positions. Watch it actually secure funds. Watch it actually halt and alert. Most people never test this. They assume it works because the code looks right. Code that looks right and code that works at 3 a.m. during a market crash are different things.

Test the full cascade: orders cancelled, positions closed, funds secured, alerts fired on both channels, system halted. Then test it again after every code change that touches the emergency path. The first time I tested Meridian’s emergency procedure in paper trading, the alert fired but the position close sequenced incorrectly. That bug cost nothing to find in paper trading. In production, it would have cost money and sleep.

The strategy is the interesting part. The operations are the important part. Build accordingly.