Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> One of the biggest customer-facing effects of this delay was that status.github.com wasn't set to status red until 00:32am UTC, eight minutes after the site became inaccessible. We consider this to be an unacceptably long delay, and will ensure faster communication to our users in the future.

Amazon could learn a thing or two from Github in terms of understanding customer expectations.



I recently stepped into a role with a devops component, and one of my first surprises was just how slow status.aws.amazon.com was to update about ongoing issues. I had to scramble to find twitter and external forums confirmation for the client.


What's even worse is that when Amazon finally updates their status page it's usually still a green icon with a little i tick for "information" even if it was a partial outage. It takes a lot for the icons to go red which is what you'd look for if you're experiencing issues.

I do the same thing, often searching Twitter for "aws" or "outage" and find people complaining about the problem which confirms my suspicions. It's a sad state of affairs when you have to do this and Amazon doesn't seem interested in fixing it.


The most recent issue that affected me was when all EC2 instances in VPCs couldn't connect to S3. At all.

It wasn't indicated on the status page until after it was fixed. And it was indicated as a green check in a sea of green checks. With a small "i" in the corner to represent the outage.

I love AWS. It's not without fault but overall I think it's been well architected, well documented, and well implemented.

But the status page has got to be the ultimate example of what not to do.


Huh, I wonder if the status page is in fact based on any automated monitoring at all, or just manual updates? I guess probably automated monitoring, just not very good automated monitoring.


IIRC, it's manual and way up the chain.


That is pretty ridiculous.


Considering how slow it is, my guess would be that it's manual.


If you have a support agreement with them then file a ticket requesting better customer communication and link back here as an example of how to do it right.

I think everyone complains in forums and online but doesn't actually file tickets about it. These things are worth tickets too.


I take it you have no experience filing tickets with them. A typical ticket goes something like this:

1. File ticket.

2. Wait. Then wait some more. Even if you pay big money for a support contract, they take a long time to respond (often > 1 hour).

3. Get a response from a first level rep who has no access to anything, has little dev experience, and asks some inane questions which I'm convinced is a purposeful stalling tactic.

4. Play the dumb question/obvious response dance, waiting an hour or more for a response each time.

5. If you are lucky (usually a couple hours in now) they acknowledge there's some problem (but never give you any detail) and escalate your ticket to a higher level internal team. If you are unlucky, you are calling up your account rep (do you even have one??) and getting them to harass tech support.

6. Usually around now the problem "magically" disappears if you haven't already fixed it yourself.

7. If you are lucky, a few hours, days, weeks later you get a response asking if you are still having the problem? You, of course, are NOT having the problem since you long ago solved it yourself. If you are really unlucky they try to schedule a meeting with one of their "solution architects" who is then going to waste an hour of your time telling you how to properly "design" your software for the cloud (i.e. trying to sell you on even more of their services).

8. Ticket is closed having never gotten to the bottom of the problem, maybe get a survey.

I've never seen this go down differently. Filing more tickets isn't going to change this. You want to really change things?

STOP PAYING THEM!

If a few mid-sized customers stop paying them and make a big-stink when they do it, then I guarantee you things will change! Until then, they have little incentive to improve and the big customers have a direct line to Amazon so they can circumvent all this crap. It's up to the small and mid-sized customers to push for change and the most effective way to do this would be to spend your money elsewhere.


To be honest, I've always found their support to be really good. Sometimes it can be a little slow to start, but I regularly experience technicians that go way above what I would expect to assist me & deliver a great outcome. If other companies in Australia were as responsive as them (e.g. telcos), I'd be a very happy man. EDIT: I'm on Business Support, so maybe that's your issue?


I'm also in Australia and have nothing but good things to say about AWS support, and are usually solved by the first responder (not necessarily on the first response). The technical skill has generally been pretty good.

But it's not specific to us down under - the support contacts come from all over the globe. We dropped from Business to Developer support when the $A tanked in order to save a buck, and it just takes a little longer is all - no real drop in quality. I wish other large companies had their level of support quality.


I'm on business support too and generally am talking to a rep in minutes. They aren't always able to find the problem before I do, but I always get follow up details later on the how / why that they did determine.


I wish our experience was like this. We used to have business level but we dropped it because we weren't getting value for it. Our experience was slightly better when we had it but we still ended up either fixing most problems on our own or waiting them out.


How much do you pay per month for AWS? That might be a difference.


> Sometimes it can be a little slow to start

Which is unacceptable for one of the largest infrastructure providers. So many times we were sitting around twiddling our thumbs waiting for our expensive amazon support to get back to us when things were broken.


Same experience here. But: I've had luck complaining with a few well-chose hashtags and mentions on twitter, getting the attention of a tech lead related to a particular AWS service.

One example: redshift. Had an expensive temporary cluster that couldn't be deleted, for days. Was stuck "pending" or "rebuilding". Assigned account rep would take forever to respond, and just didn't understand, would forward directions to using AWS console. Yeah, DOESN'T WORK. After a week decided to try getting attention on twitter, got it fixed in about 12 hours.


>2. Wait. Then wait some more. Even if you pay big money for a support contract, they take a long time to respond (often > 1 hour).

My experiences don't reflect this, perhaps we are familiar with different levels of support contracts. I use AWS for work only so I can only speak to one level if their support.

>3. Get a response from a first level rep who has no access to anything, has little dev experience, and asks some inane questions which I'm convinced is a purposeful stalling tactic. 4. Play the dumb question/obvious response dance, waiting an hour or more for a response each time.

I can't agree with this either. I almost always use their chat option and a rep is usually available within 15m unless there is an AWS outage.

I do however completely agree with 5 and 6, but I don't let it bother me. They can't expose too much info about their infrastructure. I'm usually just looking for a confirmation of an issue in their side or not which they have always been willing to provide.

If you're using aws for business and are unhappy with their current level if support maybe you should talk with their sales folks to find out about higher tier support plans.


I think a lot of folks feel that it's a useless endeavor, so they don't bother. Amazon's been operating this way for years, and they're quite a large company; it seems unlikely to me that fundamental change can happen inspired by customer tickets, even if you're paying for support.

Basically, if Netflix isn't the source of the complaint, they're not going to give two fucks.

/me suspects that netflix engineers get outage notifications through some other avenue than the status page.


You are pretty spot on about that.

This was the post I was googling for

http://techblog.netflix.com/2015/10/flux-new-approach-to-sys... (prepare to have your mind blown)

And what I found in the Google results

From 2015-04 http://techblog.netflix.com/2015/04/introducing-vector-netfl...

And 2014-01 http://techblog.netflix.com/2014/01/improving-netflixs-opera...

That is some crazy fast innovation there


I've filed tickets about their status page before, especially on the stupid green-checkmark-with-i.

Hasn't worked yet.


day to day i mostly write software, but I also help manage our infrastructure (we're a small company - 9 people total, 4 engineers, I'm one of the 2 that understands managing servers well enough to support it). We were on linode up until about a year and change ago and switched to AWS/Opsworks to both decrease our infrastructure bill and increase our ability to scale horizontally quickly (for unfortunately long definitions of quickly - "running setup...")

Both Linode and Amazon suck at their status pages (though linode was quite informative about their DDoS outages that started on Christmas). Every amazon issue we've had, the status page only changed once they'd more or less fixed it. As far as I'm concerned their status page is basically useless unless it's an extended outage, at which point it's still basically useless...


> Amazon could learn a thing or two from Github in terms of understanding customer expectations.

Do you mean that "the cloud provider that is bigger than the next 14 combined and whose jargon has spread through the community" doesn't understand what customers are interested in and delivering on that?


Gonna speak up to defend OP here: I've worn the devops hat for products across multiple "Large Companies" (Amazon and larger scale) and found that for small products where it was me and a few other devs keeping the lights on, we would have outage alerts on status pages/twitter typically _before_ public users even realized something was wrong, since we were all very high touch on the project.

The bigger a project gets, the less prioritized something like a status page often seems to get. Larger entities certainly _have_ them but I often see more things interfering as scale grows (this isn't only a MS thing, let me make clear) whether it be domain switches between engineering and social management (status is often via twitter), feeding the status page via a long telemetry/monitoring platform that has some lag, or a high threshold for what "outage" means to avoid flappy notices (at the cost of some false negatives).

I'm not even going to make a value judgement on the tradeoff of these costs at this point, (I certainly wouldn't dismiss it offhand as a net negative although equally it's not all roses) but at the very least I'd observe that something like a status page _CAN_ be serviced very well from an up and comer (for as much as Github is that any more) and it's far from a true statement that bigCOs can't take learnings from improving customer happiness from newer entities. (In fact, I wish that was a more common practice!)


Do you mean that just because a company has huge market share it must be doing every single thing better than its competitors?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: