Data Center Failure

Hi All -

I’m testing out cockroach DB to use for our production datastore for some of our applications. These applications are really not in the line of queries and is used for offline processing, so we would really prefer consistency and resiliency over performance. There might be tweaking needed with my setup but it’s essentially this:

US-East

5 Nodes
Replication Factor of 3 (default)
Locality=datacenter=us-east

US-West

5 Nodes
Replication Factory of 3 (default)
Locality=datacenter=us-west

When I start up US-West cluster and bootstrap, etc everything works perfectly. I Even ran a similar haproxy and ycsb load test similar to Auto Cloud Migration documentation. When I start up US-East, I see everything work according to the documentation and the replica’s per node even out across the two DC’s. Even the performance numbers seemed acceptable to me as well.

However, my problem was this: when I shutoff US-East (to sort of simulate a network and/or DC failure), the US-East cluster became unresponsive. The CockroachDB UI also became pretty unresponsive, and I wasn’t able to access any of the data that should have been in US-West.

Is there something wrong with my setup? Thanks for taking a look at this inquiry - I’m hoping its just something simple I’ve overlooked and appreciate someone with more experience just giving my setup a once over.

Thanks!

Your problem here is that you only have two availability zones, which is not enough to survive a full data center outage, since Cockroach requires a quorum (strictly greater than half) of the replicas for a range to be available. If you want to be able to survive a full DC outage, you’ll need to be replicated into a third AZ.

To make this more concrete, let’s look at a single range, A.

A must be replicated across 3 nodes, and 2 replicas (quorum) must be available for the range to be available. Cockroach will do it’s best to evenly distribute your nodes across the localities provided, but with only two data centers, it needs to double up somewhere (pigeonhole principle).

So there are two options for how A is distributed, either two replicas are in US-East and one is in US-West, or the opposite; one replica in US-East and two in US-West. Either way, there exits some availability zone where 2/3 of A's replicas are stored, and removing that AZ will result in loss of quorum/availability, so you have a roughly 50/50 shot of a data center outage making A unavailable.

In practice, you’ll have multiple ranges, each one having a roughly 50/50 chance of becoming unavailable in your setup. So when you shut off US-East, you should expect to see about half of your ranges become unavailable, which is what you’re experiencing.

If you were to add a third AZ with a different locality US-South, for instance, Cockroach would (space permitting) put one replica in each AZ. This would allow you to suffer a complete data center outage without losing quorum for any ranges, and maintain availability.

1 Like

Thanks so much for your response! This is helpful, let me give that a whirl. Thanks again!

That totally worked as advertised! Thank you for your help!