Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Sorry, are you saying you worked at Amazon and this is how they handle major outages? Just snooze and wait for a ticket to make its way up from end user support? No monitoring? No global time zone coverage?

Because if so, this seems like about the most damning thing I could learn from this incident.



No, it's just mindless speculation from someone who clearly hasn't worked a critical service's on call rotation before. Not at all what it's actually like, all these services have automatic alarms that will start blaring and firing pagers, and once scope of impact is determined to be large escalations start happening extremely quickly paging anyone even possibly able to diagnose the issue. There's also crisis rotations staffed with high level ICs and incident managers who will join ASAP and start directing the situation, you don't need to wait for some director or VP.


I worked at AWS (EC2 specifically), and the comment is accurate.

Engineers own their alarms, which they set up themselves during working hours. An engineer on call carries a "pager" for a given system they own as part of a small team. If your own alert rules get tripped, you will be automatically paged regardless of time of day. There are a variety of mechanisms to prioritize and delay issues until business hours, and suppress alarms based on various conditions - e.g. the health of your own dependencies.

End user tickets can not page engineers but fellow internal teams can. Generally escalation and paging additional help in the event that one can not handle the situation is encouraged and many tenured/senior engineers are very keen to help, even at weird hours.


“There are a variety of mechanisms to prioritize and delay issues until business hours”

What are business hours for a global provider of critical tech services?


Business hours for the team receiving the alarm; many issues can wait to be resolved during your own waking hours if they are not impacting customers.


"This is important enough for someone to work on as soon as their shift starts, but not important enough to page someone out of bed for."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: