What should I do after 2 nodes crashed immediatly

I am doing some high availability tests for crdb(V2.0.2).

My test is

  1. build cluster of 6 nodes
  2. keep pressure with sysbench (my own oltp test case)
  3. pkill -9 cockroach in node4 and node 5 immediatly
  4. after step 3, the cluster cannot be connected for ever

When I rejoin node4 or node5, cluster recover.

But if node4 and node5 are permanently down, how can I recover
the cluser?

This sounds like you ran with the default replication factor of 3. In that case you van survive one node crash. If your cluster is in balance again, you can safely crash an other node and keep your cluster available.
If you want to be able to survive two concurrent failures you need replication factor 5.
For any given range, more than half of the occurrences should be available to keep the range available.
So in your current situation: restart one of the nodes again.

1 Like

Understand, I cannot make 2 nodes crash at the same time.

Another question, if I kill 2 nodes(node4 and node5) one by one, how long should I
wait before kill node5?

In my test, after kill first node, I must wait for ranges under replicated being 0, then
I can safely kill second node.

With a replication factor of 3 and a 3 node cluster, you need to wait for all ranges to fully replicate before killing the second node. Otherwise you will make any under-replicated ranges unavailable.

With a replication factor of 3 and 5 nodes, there’s still a risk of making data unavailable. Say you bring down n5 and you have 10 under-replicated ranges afterward. If you kill n4 without waiting, and any of those ranges had a replica on n4, then that data would become unavailable.

So yes, to be safe you need to wait for ranges under-replicated to be zero before killing a second node, even with 5 nodes running.