Question about replication node failure behaviour

Hi. As I’m new to CockroachDB I just read the Production Checklist. My datacenter offers only two availability zones (AZ) and therefore I would have to place two nodes in AZ1 and the third node in AZ2. What happens exactly if AZ1 goes offline for 30 minutes (eg complete network loss)? Is the remaining node in AZ2 still working or will it go offline, too?

Hello…if two out of three nodes go offline, the third node won’t go offline but the cluster as a whole will be unavailable since CockroachDB requires quorum to serve data.

1 Like

Thanks Amruta. But how do I get tolerance against complete failure of one AZ then (if only having two)? If I use 5 nodes, I face the same issue. What about two or four nodes (some sort of cross-master)?

As I understand it, 5 nodes won’t have the same problem since the default replication factor is 3. So if one AZ with two nodes goes down, the nodes in the other AZ will maintain quorum and your cluster will remain available. But to be honest, I am not 100% sure I got this right. Let me check with our team on Monday and confirm it for you if that’s okay?

In the meanwhile, this might help: Fault Tolerance & Recovery | CockroachDB Docs

Thank you Amruta.

As I understand it, 5 nodes won’t have the same problem since the default replication factor is 3.

But one of the two AZ does not have quorum then (the one with two nodes). So if the AZ with three nodes goes down, the other AZ goes down, too (no quorum). Or did I get this wrong?

I search a way to get redundancy with two AZ. In the past we run two MySQL servers in cross-master-replication. No matter which AZ goes down, the other stays fully functional. I need something similar with CockroachDB.

…check with our team on Monday and confirm it for you if that’s okay?

Yes, of course.

@Amruta any news here?

Yes, I discussed it with the team and will let them respond. As I understood it, you could set up three-node clusters each in the two AZs but I’m not sure that will solve the issue since CRDB does require three failure domains. I’ll let the team respond since they will be able to explain it better and also answer any additional questions you might have.