It mentions an issue at an upstream service provider. Is it AWS and their Degraded EBS Volume Performance in Northern Virginia? https://status.aws.amazon.com/
Is there an "uncanny reliability" range where increasing reliability on the part of a service provider makes things worse, by being so close to 100% reliable that any failure is a shock?
Maybe it's better to go with cheaper services that fail more often, thus keeping customers in good practice for how to deal with it.
This is something along the line what I say in the Scaling chapter in my book[0]. If your infra is really simple (like a server or two), you can actually recreate it in a different provider and beat any hard to fix issue (whole AWS region going down or this Heroku's databases problem).
Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.
My book also contains a Bash script to configure you a PostgreSQL cluster in a few minutes with/without attached storage space, with self-signed SSL, SELinux, and more. Great for simple apps and as a start in learning production PostgreSQL.
> Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.
This is so true you have no idea. Several years ago I was working at a Linode customer on the Christmas Eve that they started being DDOS'd for several days.
We had been working for weeks before then to multi-host our applications just to be prepared for outages and suddenly all of that work paid off.
We already had all of our data ready at another provider and the infrastructure hot so it was just a matter of flipping some configs and waiting for DNS propagation. I still ended up working 20 hours that day just monitoring everything and calming people down but the alternative would have been working straight through New Year's.
Yes. There's a nice example of this in the Google SRE book (I think it may have been their internal paxos service?) If I remember they ended up building in planned downtime so users could learn to degrade gracefully if the service went down.
Google does this pretty regularly internally. Every system has a published SLO, and for a couple weeks a year major components will respect their SLO and not a single request or millisecond better. If you were relying on something performing 10x better than what it's rated for in order to provide your own guarantees, then that's on you.
Often that means spending 10x on building failure-tolerant architecture.
For example, software may assume that files get corrupted sitting on a disk, and work around that. But it turned out to be easier to build in the self-healing redundancy checks to the bottomest layer possible, to hard drives, and assume it's clean afterwards.
Another thing I've heard of, is when they make space radiation-resistant CPUs, instead of making the CPU robust to miscalculations, it's easier to shield it as much as possible, use larger process nodes (like 110nm+). Of course, they also make all kinds of checks in the software as well, because they do real engineering.
Heroku has been strangely unreliable the past few weeks. Even their ticket response team has been slow, with their support engineers often talking past the issue to just send a scripted reply.
We have the majority of our client apps hosted with them, but most don't require 24/7 availability. This is still concerning though, and we do have one high-availability app hosted on them now that we're trying to plan contingencies for.
Open to any suggestions for alternatives! Ideally I'd keep things on Heroku, but it would be nice to have failsafes that could be activated relatively quickly in the event of similar issues.
Simple dynos can be replicated with Dokku and Ledokku as a GUI. Just get an Ubuntu VM on Digitalocean, Vultr or whatever, install and configure UFW, fail2ban and automatic security updates, install dokku and you're set.
For managed databases with replication however, Dokku still leaves much to be desired...
Heroku provides many features like pipelines and review apps that would be impossible to implement on a single VPS and very time-consuming to implement on multiple VPSes. Anyone who recommends a single VPS as a hosting solution (as lbruder did) is likely a hobbyist or a student.
Maybe it sounded a bit simplistic in the description, but running VMs /servers in the cloud/datacenter with CICD pipelines, VM patching, testing, the whole nine yards, etc is not that extreme, difficult, or ridiculous as most of us think it is (e.g. comparing it with making your own flour or growing your own coffee).
There are plenty of professionals doing it this way. Agreed that one machine would not be enough for all that, and building it will take more time, but it has its own advantages of being in control.
Curious to hear your opinion on this if you like to share.
> It seems no cloud service provider these days is able to offer what was considered an industry standard.
I wonder how many services really had 5 nines availability pre-cloud era either. Somehow I feel your view of it being "industry standard" might be slightly rose tinted
Five nines uptime only exists in the mainframe world. Everywhere else it's a requirement set by someone in management, which is met by the vendor in their marketing material. It's never achieved over the long term, but enough time passes so the inevitable downtime can be blamed on the previous management. The vendor meets their "guarantee" by paying back less than a point on the yearly bill, and then everyone can reset the clock and pretend that it won't happen again.
The only people who suffer consequences are the staff forced to work overtime performing SEV0 RED ALERT theater. They will work through nights/weekends while the responsible parties tut-tut and "manage" by reading updates they can collate into the post crisis report. After that, everyone participates in the joy of emergency meetings to discuss said report that will be entirely worthless when a completely different part of the system fails the next time. A more reliable HA solution will be worked up by the engineers, finance will estimate implementation costs, and it will be turned down by an executive on the 8th hole green because they don't care about anything except improving profitability so they can hand themselves a bonus.
I worked on multiple services since the late nineties on bare metal over multiple datacentres to achieve 5-nines.
With IaaS that is now easier than ever, yet these so-called cloud service providers dont do any of that - they tie themselves to a single AZ and have ZERO redundancy.
its not about AWS/Azure etc. They are providing IaaS. Literally compute services littered around the globe. It is up to these so-called cloud service providers, like heroku, to utilize that infrastructure to achieve 99.999%.
I even gave a link in my comment to what AWS say about this.
Are people downvoting me because they dont read, or what?
You make it sound like AWS has 100% uptime and services built on top of them are completely to blame.
And for something like Heroku's managed DBs you can't just achieve 99.99999% availability on a DB without making certain sacrifices. Availability isn't everything past a certain point
Thats not what im doing at all. I even gave a link to AWS documentation of achieving 5-nines by utilizing multiple AZs, etc. I also reiterated this in the comment you responded to above..
What sacrifices you talking about when synchronously replicating to a backup environment? Write latency? How do you deal with that usually? How much is too much? There are strategies to deal with reducing replication related latency depending on level of consistency required.
Costs and general complexity. It is quite easy to accidentally reduce a system's uptime by introducing extra complexity involved with higher availability
Yeah, I suppose in some circumstances their offering is OK.
I don't think i have ever worked on a system where 10 mins of data loss is anywhere near acceptable though.
I guess for mostly static pages, or self hosting a blog its ok, although id be pissed if i had to rewrite an article. Makes you wonder who their target market is.
Less customers, less moving parts, less to go wrong. I'm sure a lot of places were basically rolling the dice, but I'd imagine a lot won that bet while those that lost it had a much more difficult recovery process than today's vendors.
For non trivial services (in particular ones that need consistency), I'm skeptical that it's realistic to achieve 5-nines at competitive cost. You'll probably achieve it for several years, and then you run into a complex failure which takes 1h to fix, blowing through the downtime budget of a decade.
Their "dynos" are ephemeral. They could literally deploy the images to a backup environment hosted elsewhere.
Their data services could all be synchronously replicated to that backup environment.
And thats it - they dont offer any other core services (and their other services run on the same platform.)
So for (at most) double their infrastructure cost they have another network they can immediately switch over to.
And herokus already soooo expensive. Even if you used a 1-to-1 mapping for ec2 to heroku dynos (which they dont - its multiple dynos per backing instance), you would be looking at 5-10x markup using on-demand instances! Reserved instances are even less expensive. Spot instances can be 5x less again!
I think they could retain their current pricing model and still offer this kind of resiliency - at a minimum.
Fly.io is making strides in this direction, distributing the VMs across multiple availability zones, and routing traffic internally from their multiple geographically distributed POPs - but you need to roll your own DB VMs for multi-az synchronization..
EDIT: seems they do provide managed postgres with synchronized replication now (in beta), neat!
Services providers such as Heroku should be easily able to have five-nines uptime.
They ONLY offer fully managed services, which can be backed by the multi-cloud, multi-AZ setup I refer to - but instead a single product outage from a single upstream provider in a single datacenter is affecting all their clients.
This is a regular occurrence for Heroku - and they charge a substantial premium for their "service".
It's worth noting that the AWS EC2 99.99% SLA is a regional SLA, i.e. it only covers a situation where multiple AZs are down simultaneously.
One AZ going down is not covered by the 99.99% SLA. AFAIK there isn't any per-AZ SLA, only a single-instance SLA of 99.5%. The effective per-AZ SLA is going to be somewhere between the two.
I know they dont, but in the link I gave they tell you how to achieve 5-nines via redundancy - something these cloud service providers (like heroku) neglect to implement.
Dokku is great for a single host. If you have a more complicated setup you can go a long way with post-receive hooks, although it won't be as magical without buildpacks.
Interestingly, I have an app using Heroku Postgres that seems to have had zero issues during this outage. I can see data that was stored during this period of time and Rollbar doesn't show any DB connection errors.
I have been trying to deploy fix to a bug we deployed yesterday. I think they have stopped deploys as well. As the deploys are being rejected without any explanation.