One of the big issues I’ve dealt with in large distributed systems is the proliferation of failure modes as the system gets larger. Once upon a time we had a custom in-house storage system where every server had two other servers, which we referred to as a triplicate, where each server in the triplicate was a mirror of the other two.
One obvious problem with this setup is that it doesn’t auto-heal. When a server goes down, there is now a SPOF in the datacenter and Tier 0 has to bring the server back up soon or risk temporary unavailability. The upside is that it is super easy for Tier 0 to understand which machines can be downed and when. If M1234A is down for repair, M1234B and M1234C need to stay up. It is also easy to minimize the chances of each of the finite number of failure modes A+B, B+C, A+C and A+B+C. The number of failure modes grows linearly as you add more triplicates.
When we migrated some of our storage to HDFS, we faced the old issue where a given block could be on (almost) any three nodes in the datacenter. The upshot is that if a server goes down, auto-healing can begin immediately. The downside is that the number of failure modes grows factorially with the number of servers in the cluster.
One way of getting the best of both worlds is if administrators can configure “teams” of servers, where the replication of given block (or “range” in the CRDB world) is constrained to a single team. So the proliferation of failure modes is factorially related to the maximum size of a team, but only linearly related to the number of replication teams.
I’ve already verified in my scale tests that CRDB is excellent at surviving entire data centers failing. It is an amazing experience to watch the load generator just keep chugging along and 10,000 wps after killing a whole DC. Even Cassandra experiences perf issues with hint build-up but CRDB just starts healing and eventually its as if nothing ever happened.
Anecdotally however, it seems pretty easy to cause unavailable ranges by removing any combination of two servers from different data centers at the same time. Not an issue when your DCs have 3 or 4 CRDB nodes, but I imagine it will be a very big issue in production if we were to adopt CRDB as our global multi-tenant realtime storage solution.
So here are some questions. Are there configurations for “teams” that I’m not seeing in the docs? Is this on a roadmap? Also, when the inevitable loss of a few ranges does occur, either due to failing machines or due to the failings of the humans charged with their care, what are the steps to abandon those ranges, check for the violation of configured invariants (like foreign keys), clean up if necessary, and move on?