The article wasn't about the outage happening, it was about the amount of time it took to even discover what the problem was. Seems logical to assume that could be because there aren't many people left who know how all the systems connect.
> Seems logical to assume that could be because there aren't many people left who know how all the systems connect.
It's only logical presupposing a lot of other conditions, each of which is worthy of healthy skepticism. And even then, it's only a hypothesis. You need evidence to go from "this could have contributed to the problem" to "this caused the problem."
Based on what little is given in the article, it seems to go strongly against this hypothesis. For example it links to multiple past findings that Amazon's notification times need improvement going back to 2017. If something has been a problem for nearly a decade, it's hard to imagine it is a result of any recent personnel changes.
TFA does not establish how many AWS workers have left or been laid off, nonetheless how many of those were actually undesirable losses of highly skilled individuals. Even if we take it on faith that a large number of such individuals were lost, it is another bridge further to claim that there was neither redundancy in that skillset which remained, nor that any vacancies have been left unfilled since.
No evidence is given that indicates that if a more experienced team were working on the problem it would have been identified and resolved faster. The article even states something to the opposite effect:
> AWS is very, very good at infrastructure. You can tell this is a true statement by the fact that a single one of their 38 regions going down (albeit a very important region!) causes this kind of attention, as opposed to it being "just another Monday outage." At AWS's scale, all of their issues are complex; this isn't going to be a simple issue that someone should have caught, just because they've already hit similar issues years ago and ironed out the kinks in their resilience story.
Indeed, the article doesn't even provide evidence that the response was unreasonably slow. No comparison to similar outages either from AWS in the past, before the hypothecated brain drain, nor from competitors. Note that the author has no idea what the problem actually was, or what AWS had to do to diagnose the issue.
It's the most plausible, fact-based guess, beating other competing theories.
Understaffing and absences would clearly lead to delayed incident response, but such an obvious negligence and breach of contract would have been avoided by a responsible cloud provider, ensuring supposedly adequate people on duty.
An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place.
AWS engineers being formerly competent but currently stupid, without organizational issues, might be explained by brain damage. "RTO" might have caused collective chronic poisoning, e.g. lead in drinking water, but I doubt Amazon is so cheap.
> An exceptionally challenging problem is unlikely to be enough to cause so much fumbling because, regardless of the complex mistakes behind it, a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses) that AWS should have in place
You seem to be misunderstanding the nature of the issue.
The DNS records for DynamoDB's API disappeared. They resolve to a dynamic bunch of IPs that constantly change.
A ton of AWS services that use DynamoDB could no longer do so. Hardcoding IPs wasn't an option. Nor could clients do anything on their side.
> a DNS misunderstanding doesn't have a particularly large "surface area" for diagnostic purposes and it is supposed to be expeditely resolvable by standard means (ordering clients to switch to a good DNS server and immediately use it to obtain good addresses)
Did you consider that DNS might’ve been a symptom? If the DynamoDB DNS records use a health-check, switching DNS servers will not resolve the issue and might make it worse by directing an unusually high volume of traffic at static IPs without autoscaling or fault recovery.
The article describes evidence for a concrete, straightforward organizational decay pattern that can explain a large part of this miserable failure. What's "self-serving" about such a theory?
My personal "guess" is that failing to retain knowledge and talent is only one of many components of a well-rounded crisis of bad management and bad company culture that has been eroding Amazon on more fronts than AWS reliability.
What's your theory? Conspiracy within Amazon? Formidable hostile hackers? Epic bad luck? Something even more movie-plot-like? Do you care about making sense of events in general?
We've witnessed someone repeatedly shoot themselves in the foot a few months ago. It is indeed a guess that it may cause their current foot pain, but it is a rather safe one.
Twice I've had to deal with outages where the root cause took a long time to find because there were several distinct root causes interacting in ways that made it difficult or impossible to reproduce the problem in an isolated way, or to even reason about the problem until we started figuring out that there were multiple unrelated root causes. All other outages I've dealt with were the source where experienced engineers and institutional knowledge were sufficient to quickly find the cause and fix it.
Which is to say: it's entirely possible that the inferences drawn by TFA are just wrong. And it's also possible that TFA is wrong but also right to express concern with how Amazon manages talent.
It's about the time between the announcements about finding the cause. I find that to be thin evidence. There are far too many alternate explanations. It's not even that I find the idea to be implausible, but I don't think the article's doom-saying confidence level is warranted.