Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google Cloud Is Having IO Issues in US-EAST1 (cloud.google.com)
120 points by Fabricio20 on Dec 7, 2019 | hide | past | favorite | 67 comments


This outage is affecting Discord.

https://status.discordapp.com/


Being taken down by slow IO in a single cloud zone seems to implicated their overall architecture.


Looking at their job postings, they use Cassandra and Elasticsearch, so I am somewhat surprised that they didn't survive this outage. (I have not ever run Cassandra at scale, but their website says in the first paragraph that it's designed to handle regional failures.)

If I were on GCP and had paying customers, I would use Cloud Spanner. It's expensive, but it's a good piece of technology.


The gift of Spanner is it’s always slow, so if you start there you will naturally gain experience in either hiding latency or relaxing consistency requirements, which are both good skills.


Spanner is pretty fast, but as with any database you have to use it correctly to get the speed. Since it’s a non-standard database, it requires non-standard tricks and skills.


No, that's about right. You get one write per second per entity or something like that and that's as fast as it goes.

The key is to choose your entity wiseley. I wrote an app inside Google that did on the order of 10,000 writes per second and it was no problem; each entity group only showed up once every minute or so.


AFAIR, entity group is a Megastore concept. Megastore is ill-conceived predecessor of Spanner, are you confusing them? Spanner is much faster than 1 transaction per tablet.


I have run Cassandra in production. A replication factor 3 cluster over 3 AZs should be able to survive loss of a single AZ.

However, it has been my personal experience that Cassandra handles instances going down much better than it handles instances getting very slow but staying online. In that case Cassandra will work extra hard to just spin its wheels. If you know that is what is happening you can be better off manually bringing those nodes offline.


Most startup companies worry about building out features than building out reliable infrastructure. Almost always an after thought


Discord seems to be preventing new connections.


We use Discord as an interface for moderating our app and monitoring spam/reported content so this kinda sucks. Not super important though.


[flagged]


And why would that be? Are they a particular problem customer for google?


...


Discord has millions of users. Less than 1000 people are probably bothered at Google to fix the problem.

My sympathies lie with Discord users. I’m sure it’s ruining a lot more weekend plans with them.

I’ve both a discord user and someone that has spent years in an on-call rotation that included nights and weekends.


and discord's 3rd party client policy is not doing them any favours


Annoyed at being on call: How dare they do their expected responsibilities in exchange for pay. The nerve of them.


It's very possible they're not "officially" on call and don't get any extra pay for having to work in the weekend. Just because that's the current expectation of a lot of salaried workers doesn't mean it's right.


[flagged]


This type of outage happens with all of the cloud providers from time to time.


Yes. Technically it happens to all companies but politically it’s a lot different.

If you tell your customers that you have an outage because AWS is down, they will shrug and say, things happen. You are in the same boat with a lot of other people, you will no more get blamed for AWS being down than if the whole city lost power. No one ever got fired for choosing AWS.

If you’re on Azure, you might get a funny look from people who only know about AWS, but since it is Microsoft Azure you will get a pass.

But people will judge you for choosing GCP. Any little thing that goes wrong, they will be questioning why you didn’t choose AWS.


Not gonna lie, one of the first things I said to my friends is why is discord using Google, it should be AWS.


The categories of errors - and Google's ability to recognize and respond to those errors in rapid form - has always seemed extraordinarily slow and reticent compared to AWS. It's one of the concerns I'd have when recommending it.


Is it just me or are Discord's status updates legendarily bad to the point of being entirely useless? As a user of the discord app, don't tell me the "API is having issues." Just tell me, "Discord is currently down. We are working on it and apologise for the disruption." Everything else is unnecessary line noise.


The Discordapp status page is written to by developers, and as a developer I really like the insights as to what is going on. As a user, I expect you to just look at it and see it's "Having Issues, We are working on it".


Mentioning the cause of the issue seems better than just "shits broke yo" to me.


Really? What in their status update read to you as "the problem is on our side, not yours"? Understand that I'm a programmer and couldn't figure that out.


Because if it was a problem on your end, it wouldn't be listed on their status page....?


"Google's Engineering team continues to work on mitigating the issue"

I think that's pretty clear where the issue lies


I am in general a big fan of GCP, especially for startups. In many cases I find I can be productive faster in GCP compared to AWS, in the sense that it's easier to do "wrong" things in AWS if you don't have a top notch devops person.

That said, it seems like GCP has a real, significant problem with overall reliability. While I don't have numbers to compare, seems like I see frequent outage notifications on HN for GCP. They really need to focus more on reliability than new features.


In case you're curious, here's some logs about their status https://status.cloud.google.com/summary

Looking at Cloud Compute, the one responsible for the outage today and I think the big one a few months ago (13 hours!!!), they still do achieve 99% uptime (having been down 81 hours total this year).

While 99% uptime "sounds" great, this is paltry in comparison to AWS, which has set the standard for 99.99% (also known as "4 9s") in reliability, translating to about an hour of downtime per year.

Just gives you some more perspective and concrete numbers. Perhaps one of the reasons you find GCP "faster" is due to more devotion of resources to "speed" over "reliability".


Where is AWS logs for all service outages, partial or not?

Can't seem to find them, only "selected" outages. The recent, several hours long, service degradation in Frankfurt EC2 isn't in the historical list, seems like it is only in the rolling status history?

AWS and GCloud seem to also be reporting disruptions completely differently, AWS reports on tons of smaller pieces which makes any "aggregate" uptime become something completely different than if you report on larger aggregates as GCloud seems to do.

Unless I'm missing something, I don't see how one could compare comparing service availability reasonably without running large numbers of "canary" instances on both providers to actually measure aggregate availability?


I don't think you can judge uptime for either AWS or GCP from the status pages, both of them are slow to update their pages when issues start and plenty of less major issues (which could still bring down a individual customer) never make the status page at all.


> I am in general a big fan of GCP, especially for startups.

The major disagreement here from me on this is that GCP does not support IPv6 (outside of IPv6 termination on load balancers) and I do not agree with startups launching without IPv6 support in 2019.


I agree it's important, but it doesn't really bother me since I'll put Cloudflare and or a network load balancer in front of anything external facing anyway. Internal things being ipv4 is fine. Other benefits aside you halve your bandwidth bill by putting Cloudflare in front.


As a network person thank you for bootstrapping IPv6 where you can. Most folks still don’t care :-(.


I agree that they seem to have problems more frequently than AWS. I've been told it is frequently due to their network, which is incredibly advanced and efficient but also incredibly complex and therefore more easily broken.


Cloud SQL has been down for me for 45 minutes now. The status page still doesn't say, but it says this in the Known Issues for technical support:

"We are experiencing an issue with Cloud SQL instances hosted in the us-east1 region, beginning at Saturday, 2019-12-07 10:00 US/Pacific. Symptoms: Some Cloud SQL instances hosted in this region are becoming unavailable, and are refusing connections. Self-diagnosis: Connections to the Cloud SQL instance are rejected. Workaround: None at this time Our engineering team continues to investigate the issue. We will provide an update by Saturday, 2019-12-07 12:04 US/Pacific with current details."


The most recent update indicates that this outage is only affecting a single availability zone. https://groups.google.com/forum/m/#!msg/gce-operations/oYoFJ...


Well here goes the 4 nines.


No worries, that's what the other datacenters are for. :- )


Interesting, do they use a SAN or is it local storage?


Google Cloud Persistent Disks are implemented on a kind of Storage Area Network (AFAIK it's based on IP networking, IIRC it's iSCSI over Colossus /D, not on fiberchannel or other off the shelf SAN tech)


It’s a custom log-based block device implemented on top of Colossus/D. As such it connects to storage nodes using regular data center network. This is why (in some cases) both IO and network-intensive workload may compete for the same bandwidth on the host.


It's not iSCSI (at least I don't think so).

The closest thing to public literature about Colossus and D that we've ever published is the Procella paper, which describes the abstractions that Colossus provides for it. In some ways it's similar to GFS (RPC interface, writes are generally append/overwrite) but many things are completely different now.


Yeah, that's what Colossus is, but t it's not (was not?) suitable to directly implement a block store such as a (remote) disk.

You need something that provides the block store abstraction on top of the primitives exposed by Colossus/D. Think of something like what modern SSD do in order to work efficiently with the underlying flash memory.

Then you have to hook that adapter in your virtualization stack (e.g. kvm) so you can boot from the volume and mount it from inside the VM. You could implement a kernel module or do it internally in kvm/qemu somehow, but iSCSI provides a straight-forward way to implement this in user-space: you have a process on your physical machine that speaks iSCSI upstream, and speaks Colossus/D RPC downstream.

(I don't know if they still do this but I have a vague memory of somebody describing the stack of an early version of GCP while I was working there long time ago)


It’s a custom log-based block device implemented on top of Colossus/D in the user-mode lib in the virtualizer (Vanadium). Guest OS communicates using NVMe or VirtIO, Vanadium intercepts, calls PD lib.

This design has only one hop to the storage node. Low-latency workloads benefit from this design, high-bandwidth workloads sometimes actually benefit from off-loading PD to another host. To do iSCSI with one hop, you need to implement iSCSI interceptor, and basically you would have same design with less flexibility for guest OSes.

The irony, of course, that all this is a lot of legacy technologies needlessly wasting computer power: guest file-system trying to communicate with 4K blocks with “block device”, which goes through multiple layers of queues, then is re-maps to another abstraction, which goes over network to multiple hosts, etc. Not a single cloud customer ever said “we are so excited to manage volume sizes and bandwidth quotas for PD”. Better design would be to implement true data center-level filesystem to better support container workloads and leave PD for legacy cases, but Google’s storage management is so detached from reality, that it’s impossible to do cross-organizational project like this.


What do you mean by "true data center level"? A FS with posix semantics suitable for normal apps doing file IO or something else?


I would break it down like this: 1) FS-core: provide FS-like semantics on top of D (or Colossus/D). Apart from data format and GC challenges, the piece that is missing is fast DC-level locking service.

2) Then, for VMs you can do FS driver, jump to VMM and booms, you are done, multiple legacy levels of re-packing and redirection are gone. For shared access cases you do NFS/Samba interceptor and then same code path as above.

This system would be highly beneficial not only for Cloud, but for other Google properties as well: it would provide normal posix FS to Borg jobs. Amounts of equilibristics required to use any open-source package is enormous and by this time exceeded costs of developing FS multiple times, ask YouTube, MySQL, package management, etc groups.

Another example of Google’s storage craziness is cross-dc storage. This should be low-level Colossus responsibility. Instead godzillion of teams implement their own, GCS, PD, Spanner, Placer, etc. Crazy.


Haha, yeah, I do wonder about the insanity on a pretty regular basis :)

I will say though, that most of the time I don't want to use open source stuff. Observability is pretty crap, I don't want nor need software that uses write() without fsync(), and the assumptions that most OSS makes about FSs gives me nightmares on Borg.


PD is in the patent, so it should be publicly available.


Honestly, I'd describe it closer to BFM: black fucking magic.

Some stacks are mostly normal, some are really odd... and then there's storage.


They offer both: https://cloud.google.com/compute/docs/disks/

Discord's description of the issue sounds like issues with either zonal or regional SAN storage.


The beauty of distributed systems is that you spread your services across lots of different computers, so that when one of them goes down, they all go down.


“A distributed system is the one that prevents you from working because of the failure of a machine that you had never heard of.”


and typically that machine is the authentication server


Joking aside, more typically the server you never heard of will be something in the CSP’s control plane.

But that’s just a distraction. We hear about those outages once in a blue moon, because many rely on them. What we don’t hear about is that any given colo, managed service, or CSP customer’s apps go down on their own, all the time, not because of the colo or CSP.

Such outages are banal, so we forget how much more likely they are, and fail to risk-weight our engineering efforts accordingly.


It's not really distributed if there's a single point of failure?


Redundancy isn't the only reason to make a system distributed. There's also latency, cost to scale, and federated management. DNS is very distributed but redundancy is opt-in.

Many distributed systems try to be reliable by retrying failed nodes' work, but designs aren't always sound.


Not sure what you mean here. Would a similarly architected non-distributed system have higher levels of availability?


He was joking :) a buggy or poorly designed distributed system might behave more like a house of cards than independent silos


I think they're being sarcastic? The idea with a distributed system is if one part goes down it can still mostly function, which google have managed to fail at somehow.


Is this the beginning of Google's wind-down of "cloud services", as part of the de-emphasis of Alphabet and the refocus on the core ad business?


yes, they're moving out of a profitable business where they have a lot of knowledge about.

and they start this move with an outage during the busiest time of the year

why not?


> yes, they're moving out of a profitable business where they have a lot of knowledge about.

Rackspace is trying to get out of the public cloud game. We had $20k/mo account with them and the rep called me every month asking if we wanted their help (professional services) migrating to AWS. I was baffled but they admitted they don’t want to be in the Cloud game anymore and wanted to get customers off their platform.


The notification on their website is interesting:

Rackspace announced that it has completed the acquisition of Onica, an Amazon Web Services (AWS) Partner Network (APN) Premier Consulting Partner and AWS Managed Service Provider.


The reason they’re moving out is because the cloud business is not profitable for Rackspace.


My point here is that, while this is a big issue for Google Cloud customers, it's not a big deal for Google. Worst case for Google is they issue a credit for more Google Cloud services, not to exceed half the monthly payment.[1] Google Cloud is, what, 3% of volume? AWS is Amazon's big profit center, but for Google, it's not as important. Nothing to bring in top management for on a weekend.

Now if ad revenue collection went down...

[1] https://cloud.google.com/compute/sla


Respectfully Mr. Nagle, I expect more intelligent {commentary,questioning} from you, especially given your breadth of experience and background.

A mere infrastructure issue, that occurs with all providers from time to time, would hardly constitute a service "wind-down". Based on a very cursory review, GCP's official deprecation policy [1] doesn't seem all that dissimilar from everyone else.

In short, to answer your question: No.

[1] https://cloud.google.com/terms/deprecation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: