People should absolutely at least be doing some back of the envelope math on this before choosing a strategy.
If you're at N DAU, then a 12h downtime will affect a bit more than N/2 users, and some percentage of those users will become ex-users - you can run a small split test to figure out how many if you don't already have data on that. You'll also lose a direct half day of revenue. This type of thing will happen somewhere between once a year an once every couple of months, as low and high estimates.
Crunch those numbers, and you'll have an order of magnitude estimate of what downtime actually costs you, and what you can actually afford to spend to minimize it. Keep in mind that engineering and ops time costs quite a bit of money, and that you'll be slowing down other feature development by wasting time on HA.
For instance, let's say you're running a game with 1M DAU, and 5M total active users, making $10k per day (not sure if that's reasonable, but let's pretend), and you've figured out that 12h of downtime makes you lose approximately 10% of the users that log in during that period. In that case, 12h of downtime costs you a one-time "fee" of $5k, and also pushes away ~1% of your total users, which will cost you $100 per day as an ongoing "cost".
If we assume this happens exactly once, and that a mitigation strategy would work with 100% effectiveness, then you should be willing to spend up to $100 extra per day to implement that strategy; the $5k up-front loss is not nothing, but we can probably assume it'll get eaten up by engineering time to implement that strategy. If such a strategy would cost significantly more than $100 per day over your current costs, then by pursuing it you're assuming that "oh shit it's all gone to hell!" AWS events are likely to affect you multiple times over the period in question.
I'm not saying these numbers are realistic in any way, or that the method I've shown is 100% sound (I'm on an iPhone, so I haven't edited or reread any of it); I'm just saying that whether you pursue a mitigation strategy or not, it's not terribly difficult to ground your decision in numbers. They do tend to be right on the edge of reasonable for a lot of people, so it's worth thinking about them (good) or (better) measuring them.
If you're at N DAU, then a 12h downtime will affect a bit more than N/2 users, and some percentage of those users will become ex-users - you can run a small split test to figure out how many if you don't already have data on that. You'll also lose a direct half day of revenue. This type of thing will happen somewhere between once a year an once every couple of months, as low and high estimates.
Crunch those numbers, and you'll have an order of magnitude estimate of what downtime actually costs you, and what you can actually afford to spend to minimize it. Keep in mind that engineering and ops time costs quite a bit of money, and that you'll be slowing down other feature development by wasting time on HA.
For instance, let's say you're running a game with 1M DAU, and 5M total active users, making $10k per day (not sure if that's reasonable, but let's pretend), and you've figured out that 12h of downtime makes you lose approximately 10% of the users that log in during that period. In that case, 12h of downtime costs you a one-time "fee" of $5k, and also pushes away ~1% of your total users, which will cost you $100 per day as an ongoing "cost".
If we assume this happens exactly once, and that a mitigation strategy would work with 100% effectiveness, then you should be willing to spend up to $100 extra per day to implement that strategy; the $5k up-front loss is not nothing, but we can probably assume it'll get eaten up by engineering time to implement that strategy. If such a strategy would cost significantly more than $100 per day over your current costs, then by pursuing it you're assuming that "oh shit it's all gone to hell!" AWS events are likely to affect you multiple times over the period in question.
I'm not saying these numbers are realistic in any way, or that the method I've shown is 100% sound (I'm on an iPhone, so I haven't edited or reread any of it); I'm just saying that whether you pursue a mitigation strategy or not, it's not terribly difficult to ground your decision in numbers. They do tend to be right on the edge of reasonable for a lot of people, so it's worth thinking about them (good) or (better) measuring them.