I'll also admit to having difficulty understanding how is all this distinct from...

dboreham · on Feb 8, 2022

> I'll also admit to having difficulty understanding how is all this distinct from non-CRDT replication mechanisms.

This is because "CRDT" is not about a new or different approach to replication, although for some reason this has become a widely held perception. CRDT is about a new approach to the _analysis_ of replication mechanisms, using order theory. If you read the original CRDT paper(s) you'll find old-school mechanisms like Lamport Clocks.

So when someone says "we're using a CRDT" this can be translated as: "we're using an eventually consistent replication mechanism proven to converge using techniques from the CRDT paper".

lxpz · on Feb 8, 2022

Thanks!

I'll give a bit more details about CRDTs, then.

The thing is, if you don't have CRDT, the only way to replicate things over nodes in such a way that they end up in consistent states, is to have a way of ordering operations so that all nodes apply them in the same order, which is costly.

Let me give you a small example. Suppose we have a very simple key-value storage, and that two clients are writing different values at the same time on the same key. The first one will invoke write(k, v1), and the second one will invoke write(k, v2), where v1 and v2 are different values.

If all nodes receive the two write operation but don't have a way to know which one came first, some will receive v1 before v2, and end up with value v2 as the last written values, and other nodes will receive v2 before v1 meaning the will keep v1 as the definitive value. The system is now in an inconsistent state.

There are several ways to avoid this.

The first one is Raft consensus: all write operations will go through a specific node of the system, the leader, which is responsible for putting them in order and informing everyone of the order it selected for the operations. This adds a cost of talking to the leader at each operation, as well as a cost of simply selecting which node is the leader node.

CRDT are another way to ensure that we have a consistent result after applying the two writes, not by having a leader that puts everything in a certain order, but by embedding certain metadata with the write operation itself, which is enough to disambiguate between the two writes in a consistent fashion.

In our example, now, the node that does write(k, v1) will for instance generate a timestamp ts1 that corresponded to the (approximate) time at which v1 was written, and it will also generate a UUID id1. Similarly, the nodes that does write(k, v2) will generate ts2 and id2. Now when they send their new values to other nodes in the network, they will send along their values of ts1, id1, ts2 and id2. Nodes now know enough to always deterministcally select either v1 or v2, consistently everywhere: if ts1 > ts2 or ts1 = ts2 and id1 > id2, then v1 is selected, otherwise v2 is selected (we suppose that id1 = id2 has a negligible probability of happening). In terms of message round-trips between nodes, we see that with this new method, nodes simply communicate once with all other nodes in the network, which is much faster than having to pass through a leader.

Here the example uses timestamps as a way to disambiguate between v1 and v2, but CRDTs are more general and other ways of handling concurrent updates can be devised. The core property is that concurrent operations are combined a-posteriori using a deterministic rule once they reach the storage nodes, and not a-priori by putting them in a certain order.

eternalban · on Feb 8, 2022

Got it, thanks. This reminded me of e-Paxos (e for Egalitarian) and a bit of digging there are now 2 additional contenders (Atlas and Gryff, both 2020) in this space per below:

http://rystsov.info/2020/06/27/pacified-consensus.html

lxpz · on Feb 8, 2022

Very interesting, thanks for sharing.

rubiquity · on Feb 9, 2022

This sounds like ABD but avoiding the read round trip as long as the register can be modeled with a CRDT. It's interesting to see this applied to a metadata portion of an object storage system. I'm sure this has implications on how rich of storage APIs you can offer. Can you do conditional writes? Can old versions of a register be compacted to reduce metadata space? How does garbage collecting dead blobs work?

It would be helpful to see specific goals of the project laid out, especially since this bucks the trend of being an enterprise/production grade object storage service (and that's totally ok).

lxpz · on Feb 9, 2022

> Can you do conditional writes?

If by that you mean something like compare-and-swap or test-and-set, then no, that would require consensus number > 1 [0]

> Can old versions of a register be compacted to reduce metadata space?

Yes, when using last-write-wins registers, old versions are deleted as soon as they are overseded by a newer version. Almost all metadata stored in Garage uses this mechanism.

> How does garbage collecting dead blobs work?

Nodes have a shared table of block references, that identifies for each data block the set of objects (or rather, versions of objects) that use this block. Block references have a "deleted" flag that is set to true when the version in question is deleted from the store. Once set to true, it cannot go back, but a new version can always be created that references the same block. Nodes keep a local count of the number of non-deleted block references for each block, so that when it reaches zero, they queue up the block for deletion (after a small delay). There is also a metadata GC system so that block references with the deleted flag are dropped from the table after a certain time.

[0] https://en.wikipedia.org/wiki/Consensus_(computer_science)#C...