Heroku's Managed DB's have been down for 2+ hours

pgn674 · on Sept 27, 2021

It mentions an issue at an upstream service provider. Is it AWS and their Degraded EBS Volume Performance in Northern Virginia? https://status.aws.amazon.com/

mcjiggerlog · on Sept 27, 2021

Definitely appears to be a wider issue, circleci is having a major outage too: https://status.circleci.com.

hiyer · on Sept 27, 2021

Likely - they're also reporting provisioning failures in Virgina, which is consistent with what AWS is reporting as well.

throwdecro · on Sept 27, 2021

Is there an "uncanny reliability" range where increasing reliability on the part of a service provider makes things worse, by being so close to 100% reliable that any failure is a shock?

Maybe it's better to go with cheaper services that fail more often, thus keeping customers in good practice for how to deal with it.

strzibny · on Sept 27, 2021

This is something along the line what I say in the Scaling chapter in my book[0]. If your infra is really simple (like a server or two), you can actually recreate it in a different provider and beat any hard to fix issue (whole AWS region going down or this Heroku's databases problem).

Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.

My book also contains a Bash script to configure you a PostgreSQL cluster in a few minutes with/without attached storage space, with self-signed SSL, SELinux, and more. Great for simple apps and as a start in learning production PostgreSQL.

[0] https://deploymentfromscratch.com/ [1] https://gist.github.com/strzibny/4f38345317a4d0866a35ede5aba...

busterarm · on Sept 27, 2021

> Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.

This is so true you have no idea. Several years ago I was working at a Linode customer on the Christmas Eve that they started being DDOS'd for several days.

We had been working for weeks before then to multi-host our applications just to be prepared for outages and suddenly all of that work paid off.

We already had all of our data ready at another provider and the infrastructure hot so it was just a matter of flipping some configs and waiting for DNS propagation. I still ended up working 20 hours that day just monitoring everything and calming people down but the alternative would have been working straight through New Year's.

remus · on Sept 27, 2021

Yes. There's a nice example of this in the Google SRE book (I think it may have been their internal paxos service?) If I remember they ended up building in planned downtime so users could learn to degrade gracefully if the service went down.

GeneralMayhem · on Sept 27, 2021

Google does this pretty regularly internally. Every system has a published SLO, and for a couple weeks a year major components will respect their SLO and not a single request or millisecond better. If you were relying on something performing 10x better than what it's rated for in order to provide your own guarantees, then that's on you.

deepsun · on Sept 27, 2021

Often that means spending 10x on building failure-tolerant architecture.

For example, software may assume that files get corrupted sitting on a disk, and work around that. But it turned out to be easier to build in the self-healing redundancy checks to the bottomest layer possible, to hard drives, and assume it's clean afterwards.

Another thing I've heard of, is when they make space radiation-resistant CPUs, instead of making the CPU robust to miscalculations, it's easier to shield it as much as possible, use larger process nodes (like 110nm+). Of course, they also make all kinds of checks in the software as well, because they do real engineering.

zokier · on Sept 27, 2021

That's the theory behind chaos monkeys/simian army.

forgingahead · on Sept 27, 2021

Heroku has been strangely unreliable the past few weeks. Even their ticket response team has been slow, with their support engineers often talking past the issue to just send a scripted reply.

We have the majority of our client apps hosted with them, but most don't require 24/7 availability. This is still concerning though, and we do have one high-availability app hosted on them now that we're trying to plan contingencies for.

Open to any suggestions for alternatives! Ideally I'd keep things on Heroku, but it would be nice to have failsafes that could be activated relatively quickly in the event of similar issues.

lbruder · on Sept 27, 2021

Simple dynos can be replicated with Dokku and Ledokku as a GUI. Just get an Ubuntu VM on Digitalocean, Vultr or whatever, install and configure UFW, fail2ban and automatic security updates, install dokku and you're set.

For managed databases with replication however, Dokku still leaves much to be desired...

i386 · on Sept 27, 2021

I want a birthday cake. But first I'll be growing and milling my own grain, raising chickens and a cow. Water will be manually pumped from a well.

MikeDelta · on Sept 27, 2021

It is seriously not that bad at all, I would compare it to making your own cake from the flour, water, butter vs buying ready-made batter.

subsection1h · on Sept 27, 2021

Heroku provides many features like pipelines and review apps that would be impossible to implement on a single VPS and very time-consuming to implement on multiple VPSes. Anyone who recommends a single VPS as a hosting solution (as lbruder did) is likely a hobbyist or a student.

MikeDelta · on Sept 27, 2021

Maybe it sounded a bit simplistic in the description, but running VMs /servers in the cloud/datacenter with CICD pipelines, VM patching, testing, the whole nine yards, etc is not that extreme, difficult, or ridiculous as most of us think it is (e.g. comparing it with making your own flour or growing your own coffee).

There are plenty of professionals doing it this way. Agreed that one machine would not be enough for all that, and building it will take more time, but it has its own advantages of being in control.

Curious to hear your opinion on this if you like to share.

Daniel_sk · on Sept 27, 2021

Signal is down too due to an outage of a service provider (I assume AWS).

supermatt · on Sept 27, 2021

Whatever happened to 5-nines uptime? It seems no cloud service provider these days is able to offer what was considered an industry standard.

AWS even have documents telling people how to achieve exactly this! https://docs.aws.amazon.com/wellarchitected/latest/reliabili...

Why don't "premium" service providers like heroku, etc, do this?

zokier · on Sept 27, 2021

> It seems no cloud service provider these days is able to offer what was considered an industry standard.

I wonder how many services really had 5 nines availability pre-cloud era either. Somehow I feel your view of it being "industry standard" might be slightly rose tinted

nicoffeine · on Sept 27, 2021

Five nines uptime only exists in the mainframe world. Everywhere else it's a requirement set by someone in management, which is met by the vendor in their marketing material. It's never achieved over the long term, but enough time passes so the inevitable downtime can be blamed on the previous management. The vendor meets their "guarantee" by paying back less than a point on the yearly bill, and then everyone can reset the clock and pretend that it won't happen again.

The only people who suffer consequences are the staff forced to work overtime performing SEV0 RED ALERT theater. They will work through nights/weekends while the responsible parties tut-tut and "manage" by reading updates they can collate into the post crisis report. After that, everyone participates in the joy of emergency meetings to discuss said report that will be entirely worthless when a completely different part of the system fails the next time. A more reliable HA solution will be worked up by the engineers, finance will estimate implementation costs, and it will be turned down by an executive on the 8th hole green because they don't care about anything except improving profitability so they can hand themselves a bonus.

Not that I'm bitter or anything.

supermatt · on Sept 27, 2021

I worked on multiple services since the late nineties on bare metal over multiple datacentres to achieve 5-nines.

With IaaS that is now easier than ever, yet these so-called cloud service providers dont do any of that - they tie themselves to a single AZ and have ZERO redundancy.

lbriner · on Sept 27, 2021

The issue is more like the guarantee is not worth anything.

AWS/Azure/whomever "promise" 5 9s uptime. Something goes wrong, you don't get 5 9s, and what do you get?

A system that went down for 4 hours and a $50 rebate on your next bill!

sciurus · on Sept 27, 2021

Your point about the credit s stands, but could providers don't even offer 5 9 SLAs.

E.G. https://aws.amazon.com/compute/sla/

supermatt · on Sept 27, 2021

its not about AWS/Azure etc. They are providing IaaS. Literally compute services littered around the globe. It is up to these so-called cloud service providers, like heroku, to utilize that infrastructure to achieve 99.999%.

I even gave a link in my comment to what AWS say about this.

Are people downvoting me because they dont read, or what?

weird-eye-issue · on Sept 27, 2021

You make it sound like AWS has 100% uptime and services built on top of them are completely to blame.

And for something like Heroku's managed DBs you can't just achieve 99.99999% availability on a DB without making certain sacrifices. Availability isn't everything past a certain point

supermatt · on Sept 27, 2021

Thats not what im doing at all. I even gave a link to AWS documentation of achieving 5-nines by utilizing multiple AZs, etc. I also reiterated this in the comment you responded to above..

What sacrifices you talking about when synchronously replicating to a backup environment? Write latency? How do you deal with that usually? How much is too much? There are strategies to deal with reducing replication related latency depending on level of consistency required.

weird-eye-issue · on Sept 27, 2021

Costs and general complexity. It is quite easy to accidentally reduce a system's uptime by introducing extra complexity involved with higher availability

supermatt · on Sept 27, 2021

Oh yeah, there are definitely additional costs and complexity involved.

Im just saying that these cloud service providers offering managed services should be covering all that - they certainly charge as if they do!

And in the case of heroku - and their specific architecture - it is not that complex. Im aware that other cases may vary.

Amasuriel · on Sept 27, 2021

No particular love for Heroku, but you can pay for multi az failover if you want it.

https://devcenter.heroku.com/articles/heroku-postgres-ha

If you don’t pay you won’t get the feature. Given that multi az failover directly impacts their cost that seems pretty fair.

supermatt · on Sept 29, 2021

Yeah, I suppose in some circumstances their offering is OK.

I don't think i have ever worked on a system where 10 mins of data loss is anywhere near acceptable though.

I guess for mostly static pages, or self hosting a blog its ok, although id be pissed if i had to rewrite an article. Makes you wonder who their target market is.

sitzkrieg · on Sept 27, 2021

has aws ever suffered multi region blackouts? or even all AZs in one?

Macha · on Sept 27, 2021

Less customers, less moving parts, less to go wrong. I'm sure a lot of places were basically rolling the dice, but I'd imagine a lot won that bet while those that lost it had a much more difficult recovery process than today's vendors.

RantyDave · on Sept 27, 2021

Plenty. But in most cases there was luck involved.

CodesInChaos · on Sept 27, 2021

For non trivial services (in particular ones that need consistency), I'm skeptical that it's realistic to achieve 5-nines at competitive cost. You'll probably achieve it for several years, and then you run into a complex failure which takes 1h to fix, blowing through the downtime budget of a decade.

supermatt · on Sept 27, 2021

In herokus case:

Their "dynos" are ephemeral. They could literally deploy the images to a backup environment hosted elsewhere. Their data services could all be synchronously replicated to that backup environment. And thats it - they dont offer any other core services (and their other services run on the same platform.)

So for (at most) double their infrastructure cost they have another network they can immediately switch over to.

And herokus already soooo expensive. Even if you used a 1-to-1 mapping for ec2 to heroku dynos (which they dont - its multiple dynos per backing instance), you would be looking at 5-10x markup using on-demand instances! Reserved instances are even less expensive. Spot instances can be 5x less again!

I think they could retain their current pricing model and still offer this kind of resiliency - at a minimum.

Fly.io is making strides in this direction, distributing the VMs across multiple availability zones, and routing traffic internally from their multiple geographically distributed POPs - but you need to roll your own DB VMs for multi-az synchronization..

EDIT: seems they do provide managed postgres with synchronized replication now (in beta), neat!

makeitdouble · on Sept 27, 2021

TBF there is very few real world services that offer customers and non giant size companies 5-nines of uptime.

E.g. my electricity provider doesn't.

supermatt · on Sept 27, 2021

Services providers such as Heroku should be easily able to have five-nines uptime.

They ONLY offer fully managed services, which can be backed by the multi-cloud, multi-AZ setup I refer to - but instead a single product outage from a single upstream provider in a single datacenter is affecting all their clients.

This is a regular occurrence for Heroku - and they charge a substantial premium for their "service".

danjac · on Sept 27, 2021

https://floridanewstimes.com/the-decline-of-heroku-infoworld...

busterarm · on Sept 27, 2021

AWS doesn't offer 5-nines uptime on Compute.

It's S3 that is 5-nines availability.

AWS's published SLA for Compute (which includes EBS) is 4-nines.

https://aws.amazon.com/compute/sla/

terom · on Sept 27, 2021

It's worth noting that the AWS EC2 99.99% SLA is a regional SLA, i.e. it only covers a situation where multiple AZs are down simultaneously.

One AZ going down is not covered by the 99.99% SLA. AFAIK there isn't any per-AZ SLA, only a single-instance SLA of 99.5%. The effective per-AZ SLA is going to be somewhere between the two.

busterarm · on Sept 27, 2021

Design accordingly. :)

supermatt · on Sept 27, 2021

I know they dont, but in the link I gave they tell you how to achieve 5-nines via redundancy - something these cloud service providers (like heroku) neglect to implement.

ranguna · on Sept 27, 2021

99.999% uptime still mean around 7 and a half hours of downtime per year.

rafBM · on Sept 27, 2021

99.999% uptime is 5m 15s per year: https://uptime.is/99.999

bobviolier · on Sept 27, 2021

No, just 5 minutes https://uptime.is/99.999

ranguna · on Sept 28, 2021

Ups, quick maths == wrong maths.

kentf · on Sept 27, 2021

Luckily we had followers in different zones by chance. Still scary though. What are the best solutions for replicating Heroku in different clouds?

mappu · on Sept 27, 2021

If by "replicating" you mean replicating the experience, then you're looking for Dokku

Easy setup + MIT license and you get the same git push deploys, Heroku-compatible buildpacks or bring your own Dockerfile.

I would recommend running it on probably on Linode / DigitalOcean / Vultr.

strken · on Sept 27, 2021

Dokku is great for a single host. If you have a more complicated setup you can go a long way with post-receive hooks, although it won't be as magical without buildpacks.

aswinmohanme · on Sept 27, 2021

Fly.io comes close. You have docker based builds on the cloud with native multi region support. Postgres is in beta right now.

mcintyre1994 · on Sept 27, 2021

They also have built-in Heroku migration: https://fly.io/docs/app-guides/speed-up-a-heroku-app/

corobo · on Sept 27, 2021

DigitalOcean's App platform things plays nicely with Heroku buildpacks from what I've seen

thomaslord · on Sept 27, 2021

Interestingly, I have an app using Heroku Postgres that seems to have had zero issues during this outage. I can see data that was stored during this period of time and Rollbar doesn't show any DB connection errors.

pvsukale3 · on Sept 27, 2021

I have been trying to deploy fix to a bug we deployed yesterday. I think they have stopped deploys as well. As the deploys are being rejected without any explanation.

TedShiller · on Sept 27, 2021

Heroku may have been down for 2+ hours, but MongoDB has been unreliable for 10+ years.

Insalgo · on Sept 27, 2021

And why are you referring to mongo here?