How much node loss for replication factor less than total nodes


(Devangana Tarafdar) #1

Hello,

I have 5 node cluster running cockroachdb 2.0.6. My replication factor is 3. I am playing about with getting nodes down and I am a bit confused regarding how many nodes I can have up for quorum. The docs say I can lose (replication factor - 1 )/ 2 that is 1 in my cluster but I got down 3 nodes before I saw replicas being under replicated. And my cluster is still up and running. If I get down one more then the cluster will be unavailable. So is the (replication factor - 1 )/ 2 for clusters where the total number of nodes = replication factor ? Or I am missing out something here.

Thanks !

Devangana


(Ron Arévalo) #2

Hey Devangana,

The replication factor does not equal the number of nodes. The replication factor is set by using the following command fro the Cockroach SQL client:

cockroach sql --execute="ALTER RANGE default CONFIGURE ZONE USING num_replicas=5;" --insecure --host=<host>

We have some more documentation here one replication zone configuration here.

Since your replication factor is set to 3, then losing two nodes if you only had 3 nodes, would bring the cluster down, and your data would be unavailable. However, because you have 5 nodes, the data may be replicated 3 times to nodes that have not gone down, which is why your system is still up, however this is not recommended since we cannot tell what nodes will go down, those may or may not hold the replicas on them but it is more likely that you’d lose 2 replicas if 2 nodes went down and rendering the cluster unavailable.

Thanks,

Ron


(Devangana Tarafdar) #3

Hi Ron,

Thanks for the reply.

So I had another question.

So if the replicas lost in a 5 node 3 rep factor system are gone for good with the node loss, how does the process of auto rebalancing work ?

In my cluster it seems that extra replicas come up on other nodes. Which is what the docs say as well.

Thanks for the help,

Devangana


(Ron Arévalo) #4

Hey Devangana,

To clarify, the replication factor formula of (replication factor - 1 )/ 2 means you can lost n number or nodes and still keep the cluster up and running. Basically you need at majority of nodes to still be up. In the 5 nodes 3 replication factor scenario you can lose 1 node.
Does that make sense?

Thanks,

Ron


(Devangana Tarafdar) #5

Thanks ! Yes, I think I see my confusion now. That formula applies to replicas not nodes.

And also in my test I was bringing down one node at a time so the cluster was autorepairing itself each time all the way down to 2 nodes.

As you pointed out in multiple node failure scenarios, I could not predict which ranges are unavailable.

https://www.cockroachlabs.com/docs/stable/training/fault-tolerance-and-automated-repair.html


(Ron Arévalo) #6

Correct, it’s based on replicas, also just to clarify one more thing, on a replication factor of 3, 1 is the ideal number that you can lose without losing access to data, if you have a 5 node cluster, and two go down the data will be unavailable and the cluster will be down. With a replication factor of 3, 1 node failure is what can be tolerated. If these nodes cannot be brought back up and have been rendered unrecoverable by some failure then this data would be lost.


(Devangana Tarafdar) #7

Got it. Thanks for the help !