At a certain scale, it's better to just take that responsibility rather than rel...

ckozlowski · on Nov 13, 2016

One of the things I stress when talking to customers is that when one is moving to cloud, it's not just the business model that changes, but the architecture model.

That may sound like cold comfort on the face of it, but it's key to getting the most out of cloud and exceed the possibilities of an on-prem architecture. Rule #1 is, everything fails. The key advantage to a good cloud provider (and there are many) is not that they can deliver a guarantee against failure (as boulos stated correctly) but that they'll allow you to design for failure. The issue becomes when the architecture in the cloud resembles that which was on-premise. While there's still some advantages, they're markedly fewer, and as you said, there's nothing you can do to prioritize your fix.

They key to having a good cloud deployment is effectively utilizing the features that eliminate single points of failure so that the same storage controller failure that might knock you out in your on-prem can't knock you out in the cloud, even though the repair time for the latter might be longer. That brings its own challenges, but brings huge advantages when it comes together.

Disclosure: I work for AWS.

z3t4 · on Nov 13, 2016

Most ppl just want to deal with one provider though. Meaning there has to be a middle man btw you and the customer.

boulos · on Nov 13, 2016

That is quite honestly amazing turn around. Most people aren't in a position to get a replacement delivered from a vendor that quickly, but if you are, awesome. Again though, there's a big gap between simple, local storage (we could probably provide a tighter SLO/SLA on Local SSD for example) and networked storage as a service.

As someone else alluded to downthread: anyone claiming they can provide guaranteed throughput and latency on networked block devices at arbitrary percentiles in the face of hardware failure is misleading you. I don't disagree that you might feel better and have more visibility into what's going on when it's your own hardware, but it's an apples and oranges comparison.

wtbob · on Nov 13, 2016

> Most people aren't in a position to get a replacement delivered from a vendor that quickly

Back when I worked in enterprise services, it was a requirement that we had good support contracts — 1-, 4- or 8-hour turnarounds were standard.

If you're running a serious business, support contracts are a must.

boulos · on Nov 13, 2016

Physically delivered, no matter customer location? Sign me up ;).

nixgeek · on Nov 13, 2016

I'm not sure why you're so surprised, this is standard logistics management for warranty replacement services.

Manufacturer specifically asks customer to keep it informed where the equipment is physically located, and then prepositions spares to appropriate depots in order to meet the contractual requirement established when the customer paid them often a large amount of money for 4-hour Same Day (24x7x365) coverage for that device.

This isn't how hyperscale folks operate for the same reason Fortune 100's rarely take out anything more than third-party coverage when their employees rent vehicles, it becomes an actuarial decision, the # of 'Advance Replacement' warranty contracts and the $ involved, vs. buying a % of spares and keeping those in the datacenter then RMA'ing the defective component for refund/replace (on a 3-5 week turnaround time).

tl;dr - Operating 100-500 servers you should likely pay Dell and they'll ship you the spare and a technician to install it, operating >500 servers and you should do the sums and make the best decision for your business, operating >5000 servers and you probably want to just 'cold spare' and replace yourself.

phamilton · on Nov 13, 2016

We had such a contract with Dell when I worked in HPC (at a large university). Since our churn was so high (top 500 cluster) we had common spare parts on site (RAM, drives) but when we needed a new motherboard it was there within 4 hours.

Now, these contracts aren't like your average SaaS offering. We had a sales rep and the terms of the support contract were personalized to our needs. I imagine some locations were offered better terms for 4 hour service than others.

As I learned long ago, never say no to a customer request. Just provide a quote high enough to make it worthwhile.

emmelaich · on Nov 13, 2016

As nixgeek says, this is standard -- for a price.

That said I suspect most enterprises pay too much; as in their money might be better spent buying triple mirroring on a JBOD rather than a platinum service contract with a fancy all in one high end RAID-y machine.

matwood · on Nov 13, 2016

Back when I worked in a big corp with an actual DC onsite we had similar contracts. I assume they worked anywhere in the US as we had multiple locations. IIRC they were with HP.

kijiki · on Nov 13, 2016

At the scale gitlab is talking about, it is usually better to source from a vendor that doesn't provide ultra-fast turnaround, and just keep a few spare servers and switches, so you can have 0-hour turnaround for hardware failure.

Get servers from Quanta, Wistron or Supermicro, and switches from Quanta, Edge-core or Supermicro. The $$$ you save vs a name-brand OEM more than pays for the spares. Use PXE/ONIE and an automation tool like Ansible or Puppet to manage the servers and switches, and you can get a replacement server or switch up and in service in minutes, not hours.

If you're moving out of the cloud to your own infrastructure, it makes sense to build and run similarly to how those cloud providers do.

nixgeek · on Nov 14, 2016

I kinda disagree, see my other comment on logistical models, this isn't 5000+ system scale or even 500+ system scale.

There's non-trivial cost involved in simply being staffed to accommodate the model you propose, all of the ODM's have some "sharp edges" and you need to be prepared to invest engineering effort into RFP/RFQ processes, tooling and dealing with firmware glitches, etc.

Remember that 500-rack (per buy) scale is table stakes for "those cloud providers", it is their business, whereas GitLab is a software company. Play to your strengths.

kijiki · on Nov 15, 2016

In my experience, you'll be dealing with firmware glitches even with mainstream OEMs. You can avoid a lot of the RFP/RFQ and scale issues by going to a ODM/OEM hybrid like Edge-core, or Quanta. Or Penguin Computing or Supermicro. If you already have a relationship with Dell or HP, you probably won't get quite as good pricing, but they're still options.

I am shockingly biased (I co-founded Cumulus Networks) but working with a software vendor who can help you with the entire solution is very helpful.

The scale gitlab has talked about in this thread is firmly in the range where self-sparing/ODM/disaggregation make sense. I think 500 racks is a huge overestimate, I think the cross-over point is closer to 5 racks.

smallnamespace · on Nov 12, 2016

I think you're missing the point a bit -- the claim is not that the business model prevents these guarantees, but that it's inherently difficult for any party to do.

> had to explain there was nothing I could do... For five days

The good alternative would be that you controlled the infrastructure and it didn't break. The bad alternative is you directly control infrastructure, it breaks, and then you get fired.

toomuchtodo · on Nov 12, 2016

Wikipedia, Stack Overflow, and Github all do just fine hosting their own physical infrastructure.

It's untrue that "the cloud" is always the right way. When you're on a multi-tenant system, you will never be the top priority. When you build your own, you are.

Google and AWS have vested interests (~30% margins) in getting you into the cloud. Always do the math to see if it's cost effective comparatively speaking.

nixgeek · on Nov 13, 2016

Sure and GitHub also has literally an order of magnitude more infrastructure than is being discussed in these proposals and retains Amazon Web Services and Direct Connect for a bunch of really good reasons.

Transparency from GitLab is excellent but you shouldn't really generalise statements about cloud suitability without the full picture or "Walking a mile in their shoes".

sytse · on Nov 13, 2016

Yep, and we at GitLab have Direct Connect to AWS as a requirement for picking out new colo.

Pyxl101 · on Nov 13, 2016

If you guys use AWS, have you taken a look at EBS provisioned IOPS? EBS provisioned IOPS comes to mind here since it allows you to specify the desired IOPS of a volume, and provides an SLA.

> Providers don't provide a minimum IOPS, so they can just drop you.

The reason I ask is because the blog post generalizes a lot about what cloud providers offer or what the cloud is capable of, but doesn't explore some of the options available to address those concerns, like provisioned IOPS with EBS, dedicated instances with EC2, provisioned throughput with DynamoDB, and so on.

sytse · on Nov 13, 2016

EBS goes up to 16TB, we need an order of magnitude more.

You're right that the blog post was generalizing.

BTW We looked into AWS but didn't want to use an AWS only solution because of maximum size, costs, and the reusability of the solution.

hagbarddenstore · on Nov 14, 2016

Per disk. You can attach multiple disks to a single EC2 node.

So, RAID multiple EBS volumes and you have a larger disk.

solipsism · on Nov 12, 2016

That's not a strawman. It's just a false statement.

toomuchtodo · on Nov 12, 2016

Thanks for the correction.

empath75 · on Nov 13, 2016

By point of comparison, we had storage problems that went on for months or years. We had a vendor that was responsive to slas, but the problem was bigger than just random hardware failures, it was just fundamentally unsuited to what we were trying to do with it. That's the risk you take when you try to build your own.

ocdtrekkie · on Nov 13, 2016

In this case, it sounds like the cloud was fundamentally unsuited for what GitLab was trying to do with it. So definitely still a risk!