Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Given all the good work ZFS does locally, it does make you wonder what it would take to extend the concepts of ARC caching and RAID redundancy to a distributed system, one where all the nodes are joined together by RDMA rather than ethernet; one where reliability can be taken for granted (short of a rat chewing cables).

It would make for one heck of a FreeBSD development project grant, considering how superb their ZFS and their networking stack are separately.

P.S. Glad someone pointed this out tactfully. A lot of people would have pounced on the chance to mock the poor commenter who just didn't know what he didn't know. The culture associated with software development falsely equates being opinionated with being knowledgeable, so hopefully we get a lot more people reducing the stigma of not knowing and reducing the stigma of saying "I don't know".



This is hobby project I’ve been thinking about for quite a while. It’s way larger than a hobby project, though.

I think the key to making it horizontally scalable is to allow each writable dataset to be managed by a single node at a time. Writes would go to blocks reserved for use by a particular node, but at least some of those blocks will be on remote drives via nvmeof or similar. All writes would be treated as sync writes so another node could have lossless takeover via ZIL replay.

Read-only datasets (via property or snapshot, including clone origins) could be read directly from any node. Repair of blocks would be handled by a specific node that is responsible for that dataset.

A primary node would be responsible for managing association between nodes and datasets, including balancing load and handling failover. It would probably be responsible for metadata changes(datasets, properties, nodes, devs, etc., not posix fs metadata) and the coordination required across nodes.

I don’t feel like I have a good handle on how TXG syncs would happen, but I don’t think that is insurmountable.


Even if you were build a ZFS mega-machine with an Exabyte of storage with RDMA (the latencies of "normal" Ethernet in the datacenters would probably not be good enough), wouldn't you still have the problem that ZFS is fundamentally designed to be managed by and accessed on one machine? All data in and out of it would have to flow through that machine, which would be quite the bottleneck.


Because RDMA latency is still a lot lower than disk access latency, it depends more on whether or not the control logic can be generalized to distributed scale with some simple refactoring and a few calls to access remote shared memory, or whether a full-on rewrite is less time-consuming. I don't know, and I don't pretend to know.

All I know is that the semantics of RDMA (absent experience writing code that uses RDMA) deceive me into thinking there's some possibility I could try it and not end up regretting the time spent on a proof of concept.


If your entire system is connected via RDMA networks (rather common in HPC) I would not worry at all about latency. If you are buying NICs and switches that are capable of 100Gb or better, there’s a reasonable chance they support RoCE.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: