Everytime I'm reading the story, there is one question that I've never understoo...

_0nac · on Feb 4, 2015

Violently shutting down trading also exposes the bank to significant risk, eg. massively leveraged trades on stocks that were meant to be held for 3 milliseconds suddenly hanging on until the system's back up.

In hindsight, this would still have been preferable to losing $400 million, but quite obviously nobody at the time realized just how catastrophic this was going be.

xyzzy123 · on Feb 4, 2015

With the benefit of hindsight I think company wide realtime dashboard would be high priority. I guess actual deployment procedures would be higher though.

zhte415 · on Feb 4, 2015

If the system was not counting its trades, would they show up in the dashboard?

A reconciliation would be necessary, which would come from where the orders were being sent to (the exchange), but with millisecond delays a realtime dashboard seems only necessary for this case (not that it is a bad case) and while End Of Day reconciliations are needed, I'd be interested if anyone knew of exchange requirements for intra-day trading reconciliations?

geocar · on Feb 4, 2015

They had a wire-protocol they were incrementally updating without a version-byte.

That is caused by bad engineering, and they needed to do less of that.

"Realtime dashboard" and "deployment procedures" are more engineering, not less.

Bluestrike2 · on Feb 4, 2015

Given the types of transactions Knight was involved in, it's unlikely they had physical access as the servers would be in lower Manhattan to keep latency down. Couple that with the lack of any established procedures to kill their systems, and it's one hell of a nightmare. If the idea of just pulling the plug even came to them, I can't imagine how well that phone call would be received by the datacenter techs even if they believed it wasn't a hoax. And that's assuming that pulling the plug wouldn't cause other problems.

But really, the 45 minutes probably flew by faster than you or I could really imagine. You're in a crisis situation, you tell yourself you just need another five minutes to fix something. Five becomes ten, becomes twenty, and before you know it, your company is looking at a $400M nightmare.

jacques_chester · on Feb 4, 2015

In the story he points out that there was no kill switch.

And, as has been found in other disasters in other industries, kill switches are hard to test.

beat · on Feb 4, 2015

Long ago in a previous life, I worked in a factory that made PVC products, including plastic house siding. One of my co-workers got his arm caught in a pinch roller while trying to start a siding line by himself. There was a kill switch on the pinch roller - six feet away and to his left, when his left arm was the one that was caught. Broke every bone in his arm, right up to his collarbone.

He screamed for help, but no one could hear him over the other noisy machinery. Welcome to the land of kill switches.

RankingMember · on Feb 4, 2015

Yikes. That reminds me of The Machinist.

derefr · on Feb 4, 2015

It feels more and more like the only responsible way to engineer systems is with a built-in always-on-in-production chaos monkey, to always be killing various parts of them. Normally this is done to ensure that random component failure results in no visible service interuption, but in this situation, you'd also be able to reuse the same "apoptosis" signal the chaos monkey sends to just kill everything at once.

narrator · on Feb 4, 2015

Everything should be written crash-only[1]. That way they don't have to worry about pulling the plug at any time.

http://en.wikipedia.org/wiki/Crash-only_software

acveilleux · on Feb 4, 2015

Crash-only is nice and all but you can't crash the other side of a socket...

Like you couldn't crash a steel mill controller and expect the process equipment to be magically free of solidified metal. It only means the servers will come back up with a consistent state.

derefr · on Feb 4, 2015

"Crash-only engineering" is a method of systems engineering, not device engineering; it only works if you get to design both sides of the socket.

In the case of a system that needs hard-realtime input once it gets going (like milling equipment), the "crash-only" suggestion would be for it to have a watchdog timer to detect disconnections, and automatically switch from a "do what the socket says" state to a "safe auto-clean and shutdown" state.

In other words, crash-only systems act in concert to push the consequences of failure away from the site of the failure (the server) and back to whoever requested the invalid operation be done (the client.) If the milling controller crashes, the result would be a mess of waste metal ejected from the temporarily-locked-up-and-ignoring-commands process equipment. The equipment would be fine; the output product (and the work area, and maybe the operators if they hadn't been trained for the failure case) would not be.

userbinator · on Feb 4, 2015

At first I thought you had written "rocket" instead of "socket", which would also make much sense.

dredmorbius · on Feb 4, 2015

And as a general rule of thumb, what the other end of the rocket should be doing is pointing down.

ars · on Feb 9, 2015

A bit trivial, but actually a rocket flies sideways, not straight up.

You go straight up you just fall back down - you need to go into orbit which means flying sideways.

marcosdumay · on Feb 4, 2015

Nah. Some times you want it to crash too.

vacri · on Feb 4, 2015

The story also points out that there were no emergency procedures. While not as instantaneous as a kill switch, known good procedures could have significantly reduced the final effect.

masonium · on Feb 4, 2015

As a designated market maker, there are probably regulatory requirements that force them to be in the market. Granted, if they'd known the full extent of the damage, they almost certainly would've pulled the plug. But, I'm guessing that was a factor.

easytiger · on Feb 4, 2015

Hedging.