Removing and rebalancing lost (unclean) nodes

Hi,

when testing how cockroach will survive a single node failing i did:

  • start a 3 node cluster, which now has 93 ranges perfectly replicated
  • unclean shutdown one node and destroy the disk
  • bring back the machine
  • now i have 4 nodes, 1 down, 74 ranges and 34 undereplicated

the lost node is never removed from the list.

in metrics, i can see all 74 ranges are resynced to the crashed machine, so this works fine,
but aftee waiting several days the dead node is still there

When you’re ready to say goodbye to a node, you’re supposed to run cockroach node decommission against it. See the docs
I believe that after that, in versions 20.2 and earlier the node would show in the Recently Decommissioned Nodes section of the UI and stay there… forever? But in the upcoming 21.1 release, things have changed and the node should completely go away after a while… I’m a bit fuzzy on the details; we’ll try to improve the docs.

or more correctly, only supports planned outages, not unplanned outages from failures, etc.

you’re supposed to run cockroach node decommission against it

yes, i i did see this as a standard response for all similar forum questions, but I’m confused, because this would mean cockroachdb does not actually support recovering from outages. What is the point of clustering then? just for load distribution?

We most certainly support “recovering from outages”, depending on what you mean by it. The point is not just load distribution; the point is also high availability.
You’re probably asking about the 34 ranges reported as under-replicated? That’s probably an artifact of the fact that the default replication factor of system ranges (as opposed to user ranges) auto-magically goes from 3 to 5 if the cluster is detected to ever have more than 3 nodes. And in your case, it was probably detected to have 4 nodes for a while, depending on various timings. So now the system wants to replicate these system ranges 5x, but it can only do 3x. There’s various things to improve around this experience, but you can manually set the desired replication factor with something like

ALTER RANGE system CONFIGURE ZONE USING num_replicas = 3

See around here

oh THIS is what it means, thank you!
I rebuilt the cluster in the meantime to run other tests, but your link is helpful.

What should i be looking out for instead for a rough indicator of the clusters health? The underreplicated ranges seemed to be the most prominent indicator, so i assumed thats it

I don’t have a great answer for you; there’s many things that go into a cluster’s health, and there’s also subjective components to it (what’s healthy on one type of hardware is not on another). I can pedantically say that you should be monitoring your applications, cause that’s where you can make a clear determination about whether things are fine or not.
On the CRDB side, you should look at CPU usages (on the “hardware” tab in metrics), the median SQL query latency (under “SQL”). And indeed the underreplicated counter.