"Bring out" the dead cattle (node)

Playing around with killing nodes and (re)adding with as new nodes.

How are dead nodes removed from list of available nodes and thus not shown in the console.

Dead nodes are detected by the rest of the cluster using a heartbeat protocol and a timeout. After some delay you should them disappear from the console. This delay is larger than a few minutes though (5-30 minutes), so give it some time.

I am not seeing this, but maybe I am doing it a bit “wrong”. I only have 8 small machines and simulating a crash by taking a node down and removing the old store and starting from new. So there are now two nodes from same IP. I never see the “old” disappear. It has been like this more than a day.

I am assuming that you are talking about the Admin UI’s node list; you are correct that dead nodes never disappear from this list. This list is not currently intended to be a list of currently available nodes, but a list of all nodes along with their availability status. Similarly, the “Total Nodes” statistic in the cluster summary also counts nodes which are down.

We are aware that our capabilities here are currently limited here; this a known issue, although not so much a bug, but rather something that we haven’t implemented yet. A few relevant issues:

  • I have logged #13895 to more correctly display cluster statistics in the Admin UI when some of the contributing nodes are down. For example, the “Total Nodes” statistic in the cluster summary should display separate counts for healthy and down nodes.
  • #6198 is tracking the ability to “retire” a node, which would allow an administrator to permanently remove a node. Currently, CockroachDB has no such mechanism, and thus it is always assumed that a node that has gone down might later come back.