Hi Roachers I am in the process of automating cluster churn, ie, replacing all the nodes in the cluster, one at a time. As per my previous posts, I am currently running Cockroach DB on AWS EC2. The ASG health check relies on ELB health check which uses the
http://<IP>:8080/health?ready=1
endpoint. As per the cockroach node decommission docs, we run cockroach node decommission <NODE ID> --wait=live ...
, after which the ASG detects the unhealthy node and terminates it, and then spins up a new node to replace the terminated one. The new node is able to join the cluster by issuing AWS API calls to get the list of live instances in the ASG. When doing this manually, we can look at the Admin UI dashboards but the docs don’t tell us when it is safe to remove more nodes, I am guessing the number of replicas is something to watch. What I would like to know is what CLI or API commands we can run to make sure that the cluster has caught up before removing the next node. Would there be things we need to validate in the various endpoints, eg:
> curl -ks "<IP>/_status/nodes" | jq -r '.nodes[].storeStatuses[].metrics.replicas'
1336
1336
1336
1336
1336
1336
Also would like to know if there future plans to have a dead node turn to a decommissioned node and disappear forever. If so, that would be awesome. Perhaps a cluster setting could be applied similar to server.time_until_store_dead
.