At my company, we're actually doing that as well, but we took it further: no tie...

cwp · on Oct 2, 2012

We do this at instandomainsearch.com too.

Our domain database is relatively small and changes slowly, so we do daily updates and distribute snapshots to all the servers. We have copies in 4 different AWS regions, and all the load-balancing and failover is done in client-side Javascript.

One server in each region is plenty for our current traffic levels, but if we need to, scaling horizontally is trivial. More important for us is that the in-browser load-balancing is based on measured latency, which ensures that each user gets the snappiest interface that we can provide.

All in all, it works really well.

cvillecsteele · on Oct 2, 2012

At Room Key we're using Amazon's DNS solution for load balancing (lowest latency).

cwp · on Oct 2, 2012

When Amazon came out with DNS-based load-balancing, we thought about switching, but it'd be a downgrade from what we've got already. We direct traffic to the server that's showing the lowest latency right now and switch to another server if the measured latency changes. Amazon's latency data will be much slower to adapt to network conditions, and with DNS caching, it wouldn't provide fail-over if the lowest-latency server happened to go down or become unreachable.

Amazon's scheme is pretty nifty though, and we'd definitely use it if we didn't already have a more dynamic solution already in place.

chubot · on Oct 3, 2012

I don't think your solution is comparable to Spanner, because you said somewhere you have to write custom merge functions for data. Spanner doesn't impose that on application developers.

Your system sounds really interesting though, thanks for writing it up. For most people, the hardware requirement of a Spanner-like solution does seem impractical, so maybe that's a good tradeoff.

I'm curious what industry this is in? Or what company? 17ms is really low! I wish more sites would put that much effort into their latency.

lifeisstillgood · on Oct 2, 2012

Would you mind taking a little time to explain the architecture again - this clearly is a new cut of dealing with the distributed problem and I am not clear how it fits together

afaik you have 3+ data centers, and need to ensure consistency between them. They agree on an exact time (hmmm how - Ntp offsets where each data center acts as stratum two for the others, eventually you see each others drift???) anyway, exact time agreed, timestamps on every transaction, labelled uuids and then send them as updates to other centers. Is thatroughly it? Or am I way off

how do you deal with collisions - if I sell the same room twice in two data centers who loses, how is the business prcess handled?

Thx

erichocean · on Oct 2, 2012

Sure.

afaik you have 3+ data centers, and need to ensure consistency between them. They agree on an exact time (hmmm how - Ntp offsets where each data center acts as stratum two for the others, eventually you see each others drift???) anyway, exact time agreed, timestamps on every transaction, labelled uuids and then send them as updates to other centers. Is thatroughly it? Or am I way off

The servers do not agree on exact time, but they do agree on the timestamp for a particular update, using a lightweight consensus protocol.

So, here's the properties that actually matter:

1. A client is connected to a node in a single datacenter.

2. Multiple clients are connected to multiple nodes across datacenters.

3. At timestamp T, all clients should be able to read the same data from every datacenter if it can read from any datacenter (this is what is meant by "global consistency").

4. It should always be possible to read at time T even if it is actually some time T' in the future. (This is what's needed to support snapshots.)

We label an arbitrary collection of records as an "entity". All records for an entity can be stored on a single machine, and are updated atomically (transactionally). On average, an entity should be < 1MB in size, but we currently allow them to get as large as 64MB if needed. (We serialize updates to the same entity in the same datacenter; in practice, we could use Paxos here (like Google), but our particular application doesn't need that complexity.)

Entities are identified by a UUID, and a particular commit id refers to a particular version of an entity. (We do append updates.) The commit id is a git-style hash of the contents and the parent commit(s). This facilitates both auditing and merging (see below).

I'll cover how that commit gets a timestamp and thus becomes globally visible as part of the next question.

how do you deal with collisions - if I sell the same room twice in two data centers who loses, how is the business prcess handled?

The same entity can be updated at multiple datacenters containing it at the "same" time. If these updates are identical, they result in the same commit id (remember, it's a hash); otherwise, the commits will be different. (Unlike git, we DO NOT hash a timestamp with the commit.)

When committed, initially there is no timestamp. The commit is "in the system", but it has not yet become visible, except (conditionally) on the datacenter where it was just written to. It's not globally visible at this point, and we don't let any other client see it.

Next, the datacenter "votes" for the commit by associating it with a timestamp, and also forwards the commit/vote/timestamp to another datacenter. That datacenter receives the commit, and if it has no other, conflicting commit, votes with its timestamp, and then forwards it on to the next data center, recursively, until it has reached every datacenter.

A commit in our system becomes globally visible precisely when all datacenters have voted. The commit is made globally visible at the latest vote/timestamp.

At this point, we are guaranteed two things:

1. All datacenter have the commit.

2. They all have voted there is no conflict, and also provided a timestamp.

So what about the case of a conflict? When that occurs, by definition, one of the datacenters will not vote. Instead, what happens is our version of "read repair", which really, is just a merge between two commits (think: git). The merged commit and its parent commits are sent on to the next datacenter, and also given a vote/timestamp. (Our merge procedure is a pure function, and can be customized per entity.)

The only remaining details are as follows:

1. How does each datacenter know when it has received all votes?

2. How do we handle partions, which by definition, prevent voting?

We handle 1. with a global datacenter topology, that is distributed as an entity, so it too becomes globally visible atomically.

We handle 2. by arranging our datacenters in a tree topology. When a partition occurs, we disable branches lower in a tree, and we empower the parent node to "vote" for the disabled node.

In practice, thanks to this topology, we don't actually need to send around all of the votes; we can actually just forward the timestamp of the parent node to its parent, recursively.

NTP works for all of this because they all agree on the timestamp, since it's part of the vote. Each datacenter has the invariant that each new vote timestamp is monotonically the same or increasing up the datacenter tree, so we'll never have a situation where an "old" commit becomes visible after a new commit. A datacenter can only vote (for a particular dataset) with a timestamp greater than its previous vote.

At the business process level, we basically don't react to a commit until it's globally visible. That's when consensus is achieved and it's safe to move forward with a workflow, such as updating other entities.

Anyway, hope that helps!

lifeisstillgood · on Oct 2, 2012

Erich

Thats a fantastic response thank you. Sorry for the delay - ironically timezones are a problem !

I have been wanting to think a bit more about the time / global consistency architecture that is creeping up, so if you do not mind could I ask a couple more questions

1. your approach seems to conflate system time and real-world time. That is do you actually store the real world time an event took place as well as the time the datacenters all agreed. (I am assuming vote/timestamp is a MVCC style identifer.

2. If I added time.sleep(5) to datacenter X on each vote, it seems that there is a ratchet effect - since all datacenters take the latest timestamp, and no one can vote with an earlier timestamp than agreed, you will always start pushing out to the future - how can you recover? Or am I off?

3. Just how much traffic is there between your data centers - is it worth it?

In general I like the distributed datacenter / global consistency approach based on a form of MVCC and I like your voting approach. However 3. is my biggest question - sharding does seem such a simpler fix. What is the problem you are solving with everything in each datacenter?

erichocean · on Oct 2, 2012

What is the problem you are solving with everything in each datacenter?

We have two simultaneous goals overall:

1. Erlang-like uptime (preferably, as close to 100% as possible). Our product requires this, because downtime is extremely costly for our customers. Our product is correspondingly expensive, although not overly so.

2. Extremely low latency. This is why we replicate a dataset to multiple datacenters in the first place, and why all of them are always "online", in the sense that failover can be instantaneous. We want each device that connects (especially mobile devices) to be near the dataset.

We actually don't use the highest level datacenters (topologically speaking) to service clients most of the time, although we can if a particular client (usually, a mobile device) has better latency to it than to other datacenters. The physical datacenter costs goes up considerably as you proceed up the tree, since we increase the replication factor at each datacenter at each level (maximum of 7). We really, really don't want downtime.

So that's why we do this: crazy high uptime coupled with very low-latency. We also put a datacenter with a single dataset on a customer's premises, which gives us incredibly low latency on their local LAN for reads, which of course dominate our system. Our target is a 99.9% mean < 17ms measured at the client (so, round-trip). An edge datacenter is still part of the whole datacenter network, so if it goes down, the client connections will switch automatically to the next fastest datacenter containing their dataset (generally, the next one up in the tree).

Finally, our maximum datacenter size is currently 48 nodes. There is no one datacenter in our system that holds every dataset for every customer. We assign new datasets to existing datacenters using a scale-free (randomized) algorithm, which also helps in terms of downtime from random failures.

Timestamps are actually per dataset, so drift isn't a datacenter-wide issue. Nevertheless, persistent, large drift would cause timestamps to move ahead of clients, which would be weird. Currently, that never happens.

A parent rejects obviously bad data from a child datacenter, such as a timestamp 5 seconds in the future on a vote. That would simply (temporarily) mark the datacenter as down. Timestamps are not set by external clients, so this isn't really a concern in practice, it's more of a bug detection mechanism.

The traffic between datacenters is actually extremely minimal, akin to what git does in a push. We send only de-duplicated data between datacenters, not an entire entity. So, for example, if you had an entity with 100 records, and you updated one of them, we'd send the new commit and the updated record (only); the other 99 records would not be sent. The data is also compressed.

We do not replicate secondary indexes. This is one of the problems (IMO) with Cassandra's native multi-datacenter functionality. We expand a de-duplicated commit at each data center into a coalesced commit (everything in the Entity is now physically close to each other on-disk) and we fully expand it out into any secondary indexes, which can be 100x the data of the coalesced entity.

Notably, secondary indexes are also globally consistent in our system, not just the primary entity. I don't know if this is true in Spanner or not -- if anyone knows, do share! :)

lifeisstillgood · on Oct 2, 2012

I had not fully understood parent child datacenter relationship. So clients for preference update child nodes and then all nodes vote, but if all siblings agree then all parents will too (assuming no clients connected up tree). Then everything else is replication.

Seems topology plays a fairly big role, larger than I expected. Thank you. I am sorry if it seems like I was trying to steal your inner most secrets, but I find that unless I know enough to (think) I can rebuild it I really have not understood.

What I think is exciting now is that there is a rush of new distributed solutions from many different directions - the RDBMS has held sway since the late 70s and now hierarchical and network designs are coming back I to vogue - and almost anyone can play (not that you are just anyone, but you don't need to be oracle or sybase)

part of me is missing the old certainties of rdbmses and part of me is excited - I have to know the tech and choose the best option for the domain

one final question - more business orientated. You grew a solution to meet specific business needs - how long from the first idea till live client connect was it - and was there business support for the time it would take (ie were you constantly fighting for time or were you given space and time?)

hard to answer I suspect but interested anyway

cheers

erichocean · on Oct 3, 2012

The system has been in development since 2007, with many iterations to arrive at the current architecture, so it'd be hard to give a particular date.

lifeisstillgood · on Oct 4, 2012

Thank you for kindly letting me / us all into your system.

Good luck