Everytime I'm reading the story, there is one question that I've never understood: why can't the just shutdown the servers itself? There ought to be some mechanism to do that. I mean, $400 millions is a lot of money to not just bash the server with a hammer. It seems like they realized the issue early on and was debugging for at least part of the 45 minutes. I know they might not have physical access to the server, but wouldn't there be any way to do a hard reboot?
Violently shutting down trading also exposes the bank to significant risk, eg. massively leveraged trades on stocks that were meant to be held for 3 milliseconds suddenly hanging on until the system's back up.
In hindsight, this would still have been preferable to losing $400 million, but quite obviously nobody at the time realized just how catastrophic this was going be.
With the benefit of hindsight I think company wide realtime dashboard would be high priority. I guess actual deployment procedures would be higher though.
If the system was not counting its trades, would they show up in the dashboard?
A reconciliation would be necessary, which would come from where the orders were being sent to (the exchange), but with millisecond delays a realtime dashboard seems only necessary for this case (not that it is a bad case) and while End Of Day reconciliations are needed, I'd be interested if anyone knew of exchange requirements for intra-day trading reconciliations?
Given the types of transactions Knight was involved in, it's unlikely they had physical access as the servers would be in lower Manhattan to keep latency down. Couple that with the lack of any established procedures to kill their systems, and it's one hell of a nightmare. If the idea of just pulling the plug even came to them, I can't imagine how well that phone call would be received by the datacenter techs even if they believed it wasn't a hoax. And that's assuming that pulling the plug wouldn't cause other problems.
But really, the 45 minutes probably flew by faster than you or I could really imagine. You're in a crisis situation, you tell yourself you just need another five minutes to fix something. Five becomes ten, becomes twenty, and before you know it, your company is looking at a $400M nightmare.
Long ago in a previous life, I worked in a factory that made PVC products, including plastic house siding. One of my co-workers got his arm caught in a pinch roller while trying to start a siding line by himself. There was a kill switch on the pinch roller - six feet away and to his left, when his left arm was the one that was caught. Broke every bone in his arm, right up to his collarbone.
He screamed for help, but no one could hear him over the other noisy machinery. Welcome to the land of kill switches.
It feels more and more like the only responsible way to engineer systems is with a built-in always-on-in-production chaos monkey, to always be killing various parts of them. Normally this is done to ensure that random component failure results in no visible service interuption, but in this situation, you'd also be able to reuse the same "apoptosis" signal the chaos monkey sends to just kill everything at once.
Crash-only is nice and all but you can't crash the other side of a socket...
Like you couldn't crash a steel mill controller and expect the process equipment to be magically free of solidified metal. It only means the servers will come back up with a consistent state.
"Crash-only engineering" is a method of systems engineering, not device engineering; it only works if you get to design both sides of the socket.
In the case of a system that needs hard-realtime input once it gets going (like milling equipment), the "crash-only" suggestion would be for it to have a watchdog timer to detect disconnections, and automatically switch from a "do what the socket says" state to a "safe auto-clean and shutdown" state.
In other words, crash-only systems act in concert to push the consequences of failure away from the site of the failure (the server) and back to whoever requested the invalid operation be done (the client.) If the milling controller crashes, the result would be a mess of waste metal ejected from the temporarily-locked-up-and-ignoring-commands process equipment. The equipment would be fine; the output product (and the work area, and maybe the operators if they hadn't been trained for the failure case) would not be.
The story also points out that there were no emergency procedures. While not as instantaneous as a kill switch, known good procedures could have significantly reduced the final effect.
As a designated market maker, there are probably regulatory requirements that force them to be in the market. Granted, if they'd known the full extent of the damage, they almost certainly would've pulled the plug. But, I'm guessing that was a factor.