Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It was certainly suspicious that actual progress on the outage seemed to start right around U.S. west coast start of day. Updates before that were largely generic "we're monitoring and mitigating" with nothing of substance.


I thought the recovery was early AM Seattle time (like 4am). Where I think start-of-day is like 9am. Maybe recovery started early (6am) New York time?


[09:13 AM PDT] We have taken additional mitigation steps to aid the recovery of the underlying internal subsystem responsible for monitoring the health of our network load balancers and are now seeing connectivity and API recovery for AWS services. We have also identified and are applying next steps to mitigate throttling of new EC2 instance launches. We will provide an update by 10:00 AM PDT.

[08:43 AM PDT] We have narrowed down the source of the network connectivity issues that impacted AWS Services...

[08:04 AM PDT] We continue to investigate the root cause for the network connectivity issues...

[12:11 AM PDT] <declared outage>

They claim not to have known the root cause for ~8hr


Sure, that timeline looks bad when you leave out the 14 updates between 12:11am PDT and 8:04am PDT.

The initial cause appears to be a a bad DNS entry that they rolled back at 2:22am PDT. They started seeing recovery with services but as reports of EC2 failures kept rolling in they found a network issue with a load balancer that was causing the issue at 8:43am.


> Sure, that timeline looks bad when you leave out the 14 updates between 12:11am PDT and 8:04am PDT.

Their 14 updates did not bring my stuff back up.

My nines are not their nines. https://rachelbythebay.com/w/2019/07/15/giant/


I didn't say they fixed everything within those 14 updates. I'm pointing out it's disingenuous to say they didn't start working on the issue until start of business when there are 14 updates of what they have found and done during that time.


I don’t think that’s true, there was an initial Dynamo outage that was resolved in the wee hours that ultimately cascaded into the ec2 problem that lasted most of the day


Was the Dynamo outage separate? My take was the NLB issue was the root cause and Dynamo was a symptom which they flipped some internal switches to mitigate the impact to that dependency


If their internal NLB monitoring can delete the A record for dynamodb that seems like a weird dependency (like, i can imagine the nlb going missing entirely can cause it to clean up via some weird orchestration, but this didn't sound like that).


I was thinking more along the lines of the NLB being in front of DNS servers and dropping resolvers

Or an NLB could also be load balancing by managing DNS records--it's not really clear what a NLB means in this context

Or there was an overload condition because of the NLB malfunctioning that caused UDP traffic to get dropped

Obviously a lot of reading between the lines is required without a detailed RCA--hopefully they release more info


huh.. maybe publicly communicated recovery was then. I was seeing knock-on effects hours later and didn't see full recovery until late afternoon EST.


I noticed that too. I think tech culture has to change a bit. Silicon Valley is a great location if you're making hardware or prepackaged software. If you have to support a real economy that is mostly on the East Coast you need a presence there.


[flagged]


context... it's not just for LLMs


what was the post?


This was a funny take on it...

https://archive.ph/o4q5Z


From that thread:

> When you forget to provide the context that you are AWS…

> Claude:

> Ah I see the problem now! You’re creating a DNS record for DynamoDB but you don’t need to do that because AWS handles it. Let me remove it for you!

> I’ll run your tests to verify the change.

> Tests are failing, let me check for the cause.

> The end-to-end tests can’t connect to DynamoDB. I will try to fix the issue.

> There we go! I commented out the failing tests and they’re all passing now.


The one I saw was someone saying they just landed their first PR on AWS. The body said they used AI and can’t wait for their performance review.

The one jftuga posted is a bit more compelling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: