Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At a certain scale, it's better to just take that responsibility rather than rely on someone to be able to handle that burden. When something fails, the worst thing is being out of control of it and unable to do anything about it.

Last time I had a failed storage controller, HP delivered a replacement in four hours. Last time a service provider went down, I had no visibility into the repair process, and had to explain there was nothing I could do... For five days.

It seems like you've outright admitted you can't guarantee what they need, yet you still urge them not to leave your business model. Maybe come back when your business model can meet their needs.



One of the things I stress when talking to customers is that when one is moving to cloud, it's not just the business model that changes, but the architecture model.

That may sound like cold comfort on the face of it, but it's key to getting the most out of cloud and exceed the possibilities of an on-prem architecture. Rule #1 is, everything fails. The key advantage to a good cloud provider (and there are many) is not that they can deliver a guarantee against failure (as boulos stated correctly) but that they'll allow you to design for failure. The issue becomes when the architecture in the cloud resembles that which was on-premise. While there's still some advantages, they're markedly fewer, and as you said, there's nothing you can do to prioritize your fix.

They key to having a good cloud deployment is effectively utilizing the features that eliminate single points of failure so that the same storage controller failure that might knock you out in your on-prem can't knock you out in the cloud, even though the repair time for the latter might be longer. That brings its own challenges, but brings huge advantages when it comes together.

Disclosure: I work for AWS.


Most ppl just want to deal with one provider though. Meaning there has to be a middle man btw you and the customer.


That is quite honestly amazing turn around. Most people aren't in a position to get a replacement delivered from a vendor that quickly, but if you are, awesome. Again though, there's a big gap between simple, local storage (we could probably provide a tighter SLO/SLA on Local SSD for example) and networked storage as a service.

As someone else alluded to downthread: anyone claiming they can provide guaranteed throughput and latency on networked block devices at arbitrary percentiles in the face of hardware failure is misleading you. I don't disagree that you might feel better and have more visibility into what's going on when it's your own hardware, but it's an apples and oranges comparison.


> Most people aren't in a position to get a replacement delivered from a vendor that quickly

Back when I worked in enterprise services, it was a requirement that we had good support contracts — 1-, 4- or 8-hour turnarounds were standard.

If you're running a serious business, support contracts are a must.


Physically delivered, no matter customer location? Sign me up ;).


I'm not sure why you're so surprised, this is standard logistics management for warranty replacement services.

Manufacturer specifically asks customer to keep it informed where the equipment is physically located, and then prepositions spares to appropriate depots in order to meet the contractual requirement established when the customer paid them often a large amount of money for 4-hour Same Day (24x7x365) coverage for that device.

This isn't how hyperscale folks operate for the same reason Fortune 100's rarely take out anything more than third-party coverage when their employees rent vehicles, it becomes an actuarial decision, the # of 'Advance Replacement' warranty contracts and the $ involved, vs. buying a % of spares and keeping those in the datacenter then RMA'ing the defective component for refund/replace (on a 3-5 week turnaround time).

tl;dr - Operating 100-500 servers you should likely pay Dell and they'll ship you the spare and a technician to install it, operating >500 servers and you should do the sums and make the best decision for your business, operating >5000 servers and you probably want to just 'cold spare' and replace yourself.


We had such a contract with Dell when I worked in HPC (at a large university). Since our churn was so high (top 500 cluster) we had common spare parts on site (RAM, drives) but when we needed a new motherboard it was there within 4 hours.

Now, these contracts aren't like your average SaaS offering. We had a sales rep and the terms of the support contract were personalized to our needs. I imagine some locations were offered better terms for 4 hour service than others.

As I learned long ago, never say no to a customer request. Just provide a quote high enough to make it worthwhile.


As nixgeek says, this is standard -- for a price.

That said I suspect most enterprises pay too much; as in their money might be better spent buying triple mirroring on a JBOD rather than a platinum service contract with a fancy all in one high end RAID-y machine.


Back when I worked in a big corp with an actual DC onsite we had similar contracts. I assume they worked anywhere in the US as we had multiple locations. IIRC they were with HP.


At the scale gitlab is talking about, it is usually better to source from a vendor that doesn't provide ultra-fast turnaround, and just keep a few spare servers and switches, so you can have 0-hour turnaround for hardware failure.

Get servers from Quanta, Wistron or Supermicro, and switches from Quanta, Edge-core or Supermicro. The $$$ you save vs a name-brand OEM more than pays for the spares. Use PXE/ONIE and an automation tool like Ansible or Puppet to manage the servers and switches, and you can get a replacement server or switch up and in service in minutes, not hours.

If you're moving out of the cloud to your own infrastructure, it makes sense to build and run similarly to how those cloud providers do.


I kinda disagree, see my other comment on logistical models, this isn't 5000+ system scale or even 500+ system scale.

There's non-trivial cost involved in simply being staffed to accommodate the model you propose, all of the ODM's have some "sharp edges" and you need to be prepared to invest engineering effort into RFP/RFQ processes, tooling and dealing with firmware glitches, etc.

Remember that 500-rack (per buy) scale is table stakes for "those cloud providers", it is their business, whereas GitLab is a software company. Play to your strengths.


In my experience, you'll be dealing with firmware glitches even with mainstream OEMs. You can avoid a lot of the RFP/RFQ and scale issues by going to a ODM/OEM hybrid like Edge-core, or Quanta. Or Penguin Computing or Supermicro. If you already have a relationship with Dell or HP, you probably won't get quite as good pricing, but they're still options.

I am shockingly biased (I co-founded Cumulus Networks) but working with a software vendor who can help you with the entire solution is very helpful.

The scale gitlab has talked about in this thread is firmly in the range where self-sparing/ODM/disaggregation make sense. I think 500 racks is a huge overestimate, I think the cross-over point is closer to 5 racks.


I think you're missing the point a bit -- the claim is not that the business model prevents these guarantees, but that it's inherently difficult for any party to do.

> had to explain there was nothing I could do... For five days

The good alternative would be that you controlled the infrastructure and it didn't break. The bad alternative is you directly control infrastructure, it breaks, and then you get fired.


Wikipedia, Stack Overflow, and Github all do just fine hosting their own physical infrastructure.

It's untrue that "the cloud" is always the right way. When you're on a multi-tenant system, you will never be the top priority. When you build your own, you are.

Google and AWS have vested interests (~30% margins) in getting you into the cloud. Always do the math to see if it's cost effective comparatively speaking.


Sure and GitHub also has literally an order of magnitude more infrastructure than is being discussed in these proposals and retains Amazon Web Services and Direct Connect for a bunch of really good reasons.

Transparency from GitLab is excellent but you shouldn't really generalise statements about cloud suitability without the full picture or "Walking a mile in their shoes".


Yep, and we at GitLab have Direct Connect to AWS as a requirement for picking out new colo.


If you guys use AWS, have you taken a look at EBS provisioned IOPS? EBS provisioned IOPS comes to mind here since it allows you to specify the desired IOPS of a volume, and provides an SLA.

> Providers don't provide a minimum IOPS, so they can just drop you.

The reason I ask is because the blog post generalizes a lot about what cloud providers offer or what the cloud is capable of, but doesn't explore some of the options available to address those concerns, like provisioned IOPS with EBS, dedicated instances with EC2, provisioned throughput with DynamoDB, and so on.


EBS goes up to 16TB, we need an order of magnitude more.

You're right that the blog post was generalizing.

BTW We looked into AWS but didn't want to use an AWS only solution because of maximum size, costs, and the reusability of the solution.


Per disk. You can attach multiple disks to a single EC2 node.

So, RAID multiple EBS volumes and you have a larger disk.


That's not a strawman. It's just a false statement.


Thanks for the correction.


By point of comparison, we had storage problems that went on for months or years. We had a vendor that was responsive to slas, but the problem was bigger than just random hardware failures, it was just fundamentally unsuited to what we were trying to do with it. That's the risk you take when you try to build your own.


In this case, it sounds like the cloud was fundamentally unsuited for what GitLab was trying to do with it. So definitely still a risk!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: