Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As usual in catastrophic failures, a series of bad decisions had to occur:

- They had dead code in the system

- They repurposed a flag for a previous functionality

- They (apparently) didn't had code reviews

- They didn't had a staging environment

- They didn't had a tested deployment process

- They didn't had a contingency plan to revert the deploy

It could be minimized or avoided altogether by fixing just one of the points. Incredible.



> They (apparently) didn't had code reviews

I don't get that. There was no code issue. The old and new code both worked as intended, it was a deployment and deployment-verification problem.

> They didn't had a staging environment

Yes they did. They staged the new code and tested it. They did a slow deployment also.

> They didn't had a contingency plan to revert the deploy

They did revert the deploy within the 45 minutes. It made it worse.

I think you need to re-read the article. Your assessment is strange given the event.


> I don't get that. There was no code issue. The old and new code both worked as intended, it was a deployment and deployment-verification problem.

A code review could raise the issue of repurposing a flag in case they had to revert the deploy. Changing the semantics of a flag is a big no-no anyway, and there are ways to guard against that.

> Yes they did. They staged the new code and tested it. They did a slow deployment also.

But they didn't had a staging environment that matched their live environment, apparently. You want a staging environment that is 1:1.

> They did revert the deploy within the 45 minutes. It made it worse.

If you think reverting a deploy by simply pushing an older version is the same as a contingency plan, think again.


Code review could have been another set of eyes to predict the problem of reusing a flag.


If the message was as compact and low level as possible it was probably a bit flag, so in that context it makes sense to repurpose it.

Being so removed from binary and bit level interactions it can be easy to forget things like this.


I agree with the GP; I don't think code reviews or testing was the problem.

I think the best-practices they violated is that they deprecated and repurposed a flag within a single release cycle. That sort of activity should take two release cycles at least, one to remove the old functionality and one to add the new functionality.


and if you do it all well, you are paid the avg dev salary.

The value of a good dev is a realised only when someone screws up.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: