Replication Teams

One of the big issues I’ve dealt with in large distributed systems is the proliferation of failure modes as the system gets larger. Once upon a time we had a custom in-house storage system where every server had two other servers, which we referred to as a triplicate, where each server in the triplicate was a mirror of the other two.

One obvious problem with this setup is that it doesn’t auto-heal. When a server goes down, there is now a SPOF in the datacenter and Tier 0 has to bring the server back up soon or risk temporary unavailability. The upside is that it is super easy for Tier 0 to understand which machines can be downed and when. If M1234A is down for repair, M1234B and M1234C need to stay up. It is also easy to minimize the chances of each of the finite number of failure modes A+B, B+C, A+C and A+B+C. The number of failure modes grows linearly as you add more triplicates.

When we migrated some of our storage to HDFS, we faced the old issue where a given block could be on (almost) any three nodes in the datacenter. The upshot is that if a server goes down, auto-healing can begin immediately. The downside is that the number of failure modes grows factorially with the number of servers in the cluster.

One way of getting the best of both worlds is if administrators can configure “teams” of servers, where the replication of given block (or “range” in the CRDB world) is constrained to a single team. So the proliferation of failure modes is factorially related to the maximum size of a team, but only linearly related to the number of replication teams.

I’ve already verified in my scale tests that CRDB is excellent at surviving entire data centers failing. It is an amazing experience to watch the load generator just keep chugging along and 10,000 wps after killing a whole DC. Even Cassandra experiences perf issues with hint build-up but CRDB just starts healing and eventually its as if nothing ever happened.

Anecdotally however, it seems pretty easy to cause unavailable ranges by removing any combination of two servers from different data centers at the same time. Not an issue when your DCs have 3 or 4 CRDB nodes, but I imagine it will be a very big issue in production if we were to adopt CRDB as our global multi-tenant realtime storage solution.

So here are some questions. Are there configurations for “teams” that I’m not seeing in the docs? Is this on a roadmap? Also, when the inevitable loss of a few ranges does occur, either due to failing machines or due to the failings of the humans charged with their care, what are the steps to abandon those ranges, check for the violation of configured invariants (like foreign keys), clean up if necessary, and move on?

Hi @CodyYancey,

Thanks for your question. We don’t yet support “teams” of servers, but you may be able to manually approximate this functionality using replication zones: Support for “teams” is not currently on the roadmap, but it’s certainly a feature we’ll consider for the future. I’ve created an issue to track this feature request here: Please feel free to add comments or follow it for future updates.

Regarding the loss of ranges, we currently have no way to recover from such losses. If your cluster is large enough that it is likely to experience the loss of two nodes faster than the system can recover from the loss of the first one, we recommend that you increase your replication factor (but we recognize that this is far from ideal). We are looking at ways to recover from lost ranges in our next release (v2.1).

I hope this helps! Please don’t hesitate to ask if you have other questions.

– Becca

@becca I gave the github issue a look over and the ChainSets concept seems exactly like what I mean but did not have the academic vocabulary to articulate. It was a great read. The “teams” wording I actually got reading architecture docs from a certain other distributed ACID datastore that was recently open-sourced (one that unfortunately lacks a good SQL interface).

It’s awesome you guys are looking at recovery from lost ranges! One thing that sprang to mind as an easy win was allowing an operator to just copy an existing range where possible. The loss of 2 ranges causes a locked table right now, but there’s still 1 range that is probably still okay :slight_smile:

As for approximating such constructs with replication zones, I’ve put some thought into how I would do it and I keep coming up a little short on how exactly to make it work using the primitives CRDB provides. The best I can come up with is constraining an entire table to a single “team” which limits per-table scalability. If the per-replica constraint semantics had some extra primitive like “The attribute value for replica 1 must equal the attribute value for replica 2 for any given range” then the problem would actually be 100% solved, but right now you have to provide a hard coded value.

If you have any better ideas, or could at least guide my thinking on the matter, I’d love to hear!

Hi @CodyYancey,

As part of the work on improving disaster recovery, we are planning to allow an offline export from all (remaining) nodes and de-dupe to make something that looks like a backup. There is currently no plan to allow online copy to up-replicate, although things could change.

I don’t think there is currently a way to automate the behavior you want. You could manually do it with our enterprise partitioning feature by applying localities or store attributes to enable the placement of specific ranges on specific stores. But then we wouldn’t be able to automatically up-replicate when a node dies – this would need to be done manually.

– Becca

What was the other db that recently got open sourced ? (fdb?)