CLI or API to query before removing more nodes

Hi Roachers :slight_smile: I am in the process of automating cluster churn, ie, replacing all the nodes in the cluster, one at a time. As per my previous posts, I am currently running Cockroach DB on AWS EC2. The ASG health check relies on ELB health check which uses the http://<IP>:8080/health?ready=1 endpoint. As per the cockroach node decommission docs, we run cockroach node decommission <NODE ID> --wait=live ..., after which the ASG detects the unhealthy node and terminates it, and then spins up a new node to replace the terminated one. The new node is able to join the cluster by issuing AWS API calls to get the list of live instances in the ASG. When doing this manually, we can look at the Admin UI dashboards but the docs don’t tell us when it is safe to remove more nodes, I am guessing the number of replicas is something to watch. What I would like to know is what CLI or API commands we can run to make sure that the cluster has caught up before removing the next node. Would there be things we need to validate in the various endpoints, eg:

> curl -ks "<IP>/_status/nodes" | jq -r '.nodes[].storeStatuses[].metrics.replicas'

Also would like to know if there future plans to have a dead node turn to a decommissioned node and disappear forever. If so, that would be awesome. Perhaps a cluster setting could be applied similar to server.time_until_store_dead .

Hi Fatima,

it’s always safe to initiate a decommissioning process (please make sure you’re on the latest stable release, where decommissioning has seen improvements that actually make this statement true). What matters is when the node is terminated. If you decommission a node with --wait=all, that won’t return until the replicas have safely moved elsewhere – if there’s no “elsewhere” for them, the process will simply hang (but the node will remain operational).

So the process should be

decommission --wait=all 1
<wait until that returns>
replace node 1
decommission --wait=all 2
replace node 2

You mention that you see the ELB health check go bad which suggests that perhaps you’re using an older stable version? I think the node should remain healthy, even if it is decommissioned, until you terminate it. If that is not the case on v2.0.5, please let me know so we can investigate.

We are running v2.0.4 and using --wait=all so that dead nodes do not hang when decommissioning. I cannot seem to find the --wait option in the decommission docs since in v2.0.5 got released.

Hi @fat. Yes, we try to keep the docs for each version in sync with the latest patch release, which is 2.0.5 in this case.

You’ll have to use github to see that file’s history. Here’s the change for 2.0.5:

And here’s the file prior to the change: Hope that helps.

FYI, in 2.0.4, the response code for the http://$IP:8080/health?ready=1 endpoint changes from 200 to 503 immediately after cockroach node decommission has been issued. This immediate change in response code takes place regardless of which --wait setting is used (ie all, live and none):

> curl -s -o /dev/null -w "%{http_code}"  "http://$IP:8080/health?ready=1"

> curl -s -o /dev/null -w "%{http_code}"  "http://$IP:8080/health"

Can someone confirm if in 2.0.5 http://$IP:8080/health?ready=1 remains healthy until decom process is done?

The behavior in 2.0.5 should remain the same, @fat0, since, as soon as the decommissioning starts, you don’t want new SQL connections to the node. @a-robinson or @tschottdorf, can you confirm?

I just built a v2.0.5 cluster and the behavior is the same. That is fine for us right now. Thanks.

@jesse it’s intentional that the /health?ready=1 endpoint returns 503, but confirming this I discovered that we were still draining immediately upon decommission, which I had planned to change (but failed to backport). A backport is now open in (it won’t change the 503 behavior, though, so no need to worry, @fat0).