Given all the good work ZFS does locally, it does make you wonder what it would ...

mgerdts · 2025-10-21T00:39:38 1761007178

This is hobby project I’ve been thinking about for quite a while. It’s way larger than a hobby project, though.

I think the key to making it horizontally scalable is to allow each writable dataset to be managed by a single node at a time. Writes would go to blocks reserved for use by a particular node, but at least some of those blocks will be on remote drives via nvmeof or similar. All writes would be treated as sync writes so another node could have lossless takeover via ZIL replay.

Read-only datasets (via property or snapshot, including clone origins) could be read directly from any node. Repair of blocks would be handled by a specific node that is responsible for that dataset.

A primary node would be responsible for managing association between nodes and datasets, including balancing load and handling failover. It would probably be responsible for metadata changes(datasets, properties, nodes, devs, etc., not posix fs metadata) and the coordination required across nodes.

I don’t feel like I have a good handle on how TXG syncs would happen, but I don’t think that is insurmountable.

nh2 · 2025-10-20T23:48:28 1761004108

Even if you were build a ZFS mega-machine with an Exabyte of storage with RDMA (the latencies of "normal" Ethernet in the datacenters would probably not be good enough), wouldn't you still have the problem that ZFS is fundamentally designed to be managed by and accessed on one machine? All data in and out of it would have to flow through that machine, which would be quite the bottleneck.

president_zippy · 2025-10-21T02:25:23 1761013523

Because RDMA latency is still a lot lower than disk access latency, it depends more on whether or not the control logic can be generalized to distributed scale with some simple refactoring and a few calls to access remote shared memory, or whether a full-on rewrite is less time-consuming. I don't know, and I don't pretend to know.

All I know is that the semantics of RDMA (absent experience writing code that uses RDMA) deceive me into thinking there's some possibility I could try it and not end up regretting the time spent on a proof of concept.

mgerdts · 2025-10-21T00:07:28 1761005248

If your entire system is connected via RDMA networks (rather common in HPC) I would not worry at all about latency. If you are buying NICs and switches that are capable of 100Gb or better, there’s a reasonable chance they support RoCE.