Why "node info not available in gossip"

I did some scalability tests about crdb. In my test, I kill, decommission, add nodes
to the cluster. Finally every thing seems OK.

  • in consle, all 6 nodes are live, no ranges Under-replicated, no ranges Unavailable
  • for table ‘sbtest’, running sysbench with oltp like testcase is ok

But when I select count(*) from sbtest, it reports

root@172.21.250.88:8774/sbtest> select count(*) from sbtest;
pq: key range id:775 is unavailable; missing nodes: [10 11 12]. Original error: node info not available in gossip

Node 10/11/12 were added at middle of test, they have been killed and decommsioned.

Why this error happened?

What was your replication factor?
How long did you wait between two kill’s ?
If your replication factor is 3 - the default - you can only safely kill 1 node at a time. You should wait killing the next one until your under replicated ranges = 0. In order to be able to safely kill 2 nodes at a time you need a replication factor of 5.

@haomiao It sounds like you’ve accidentally killed the majority of replicas for that range. As Ronald notes, you need to wait for all ranges to be fully replicated before you can safely kill a node in a 3 node cluster.

factor is 3 (default)

I should have killed nodes after replicated ranges = 0, or the cluster cannot be alive.

And I probably once added new node when replicated ranges > 0, could this be the
reason of the above error?

Now how can I fix this error? All data of kill/decommission nodes have been cleared.

Unfortunately, if the nodes have been wiped and the ranges are no longer on a store, then nothing can be done to recover without rebuilding the cluster. If the ranges are still available on a node, then rejoining that node to the cluster would eliminate the error and make the data available.

When a node goes down, we assume that it will be added back to the cluster. So in the worst case, if we lost 2/3 nodes simultaneously, the cluster would be unavailable (as we can no longer make consistency guarantees). But as soon as node 2 rejoins, the data will become available, and as soon as node 3 rejoins, the data will no longer be under-replicated.

It sounds like 2 / 3 replicas of the data were deleted - this is unrecoverable in the current version.