> Sadly, storage isn't a first class service in most clouds What would constitut...

ChuckMcM · on Nov 13, 2016

Storage as a first class service means that storage is provided to the fabric directly. Just as routing is provided to the fabric directly by switches and routers so 'networking' is a first class service. In some networks time is provided by hardware dedicated to that so time is a "first class" service.

"Storage" as a definition is an addressable persistent store of data. But that isn't as useful as one might hope for discussions, so I tend to think of it in terms of the collection of storage components and compute resources that enable those components to be accessed across the "network."

So at Google a GFS cluster provides "storage" but if the compute is also running compute jobs, web server back ends, etc. It isn't the "only" task of the infrastructure and that is the definition of "second" or "not first class". Back in the day Urs would argue that storage takes so few compute cycles that it made no sense to dedicate an index to the serving up of blocks of disk. But that also constrained how much storage per index you could service. And that is why from a TCO perspective "storage as a service" is cheap when you need the CPUs for other things anyway, but it's very expensive when you just want storage. I wrote a white paper for Bart comparing GFS cost per gigabyte to the NetApp cost per gigabyte, and NetApp was way cheaper because it wasn't replicated 9 times on mission critical data, and one index (aka one filer head) could talk to 1,000 drives.

That same sort of effect happens in cloud services where if you want 100TB of storage you end up having to pay for 10 high storage instances, even if your data movement needs could be addressed by a single server instance with say 32GB of memory. The startup DriveScale is targeting this imbalance for things like Hadoop clusters.

boulos · on Nov 13, 2016

Plenty of Xooglers want "just give me direct access to Colossus". But priority-wise, the market seems to want:

1. "Give me a block device I can boot off of, mount, and run anything I want to from a single VM" (PD, which is just built on Colossus)

2. "Give me NFS".

3. "Give me massive I/O" (GCS)

I think we're doing fine-ish in 1 and 3. The main competition is a dense pile of drives in a box for Hadoop, but we lean on GCS for that via our HDFS connector (https://cloud.google.com/hadoop/google-cloud-storage-connect...). It's our recommended setup, the default for Dataproc, and honestly better in many ways than running in a single Colossus cell (you get failover in case of a zonal outage, and by the same token you can have lots of users simultaneously running Hadoop or other processing jobs in different zones).

PS - I'm going to go searching for your whitepaper (I find in arguing with folks that network is the bottleneck for something like a NetApp box, not CPUs).

[Edit: Newlines between lists... always forgetting the newlines]

ChuckMcM · on Nov 13, 2016

Of course you the cloud vendor is doing fine, the question was when is it too expensive for your customer. And my thesis is that because storage is not a first class service you can't precisely optimize storage spend (or your storage offering) for you customer and that forces them out of cloud situation into their own managed infrastructure.

You could also improve your operational efficiency but that isn't a priority yet at the big G. I expect over time it will become one and you'll figure it out but in the meantime your customer has to over provision the crap out of their resources to meet their performance needs.

If Bart is still around it was shared with him and the rest of the 'cost of storage' team back in 2009.