Decomissioning dead 1.0.6 nodes is hanging

Hey y’all. Congrats on the 1.1.0 release!

I’ve update my cluster, and am trying to use the decommission feature to clean up some permanently removed 1.0.6 nodes. I let the whole cluster update to 1.1.0 before trying the decommission, but the decommission command runs indefinitely (even with --wait=live).

Version of node running the decommission command through:

root@cockroach.foo.bar.com:26257/> SELECT version();
+--------------------------------------------------------------------------+
|                                version()                                 |
+--------------------------------------------------------------------------+
| CockroachDB CCL v1.1.0 (linux amd64, built 2017/10/12 14:50:18, go1.8.3) |
+--------------------------------------------------------------------------+
(1 row)

CLI version:

twrobel@Taylors-MacBook-Pro-2 ~/c/g/s/g/c/cockroach> cockroach version
Build Tag:    v1.1.0
Build Time:   2017/10/12 14:48:26
Distribution: CCL
Platform:     darwin amd64
Go Version:   go1.8.3
C Compiler:   4.2.1 Compatible Clang 3.8.0 (tags/RELEASE_380/final)
Build SHA-1:  8b865035e21aa4fa526ee017ba9dc685d7af649c
Build Type:   release

Decommission command and output:

twrobel@Taylors-MacBook-Pro-2 ~/c/g/s/g/c/cockroach> cockroach node decommission 1 2 3 4 5 --wait=live --insecure --host 'cockroach.foo.bar.com'
+----+---------+-------------------+--------------------+-------------+
| id | is_live | gossiped_replicas | is_decommissioning | is_draining |
+----+---------+-------------------+--------------------+-------------+
|  1 | false   |               747 | false              | true        |
|  2 | false   |               742 | false              | true        |
|  3 | false   |               751 | false              | true        |
|  4 | false   |               758 | false              | true        |
|  5 | false   |               754 | false              | true        |
+----+---------+-------------------+--------------------+-------------+
(5 rows)
+----+---------+-------------------+--------------------+-------------+
| id | is_live | gossiped_replicas | is_decommissioning | is_draining |
+----+---------+-------------------+--------------------+-------------+
|  1 | false   |               747 | false              | true        |
|  2 | false   |               742 | false              | true        |
|  3 | false   |               751 | false              | true        |
|  4 | false   |               758 | false              | true        |
|  5 | false   |               754 | false              | true        |
+----+---------+-------------------+--------------------+-------------+
(5 rows)
+----+---------+-------------------+--------------------+-------------+
| id | is_live | gossiped_replicas | is_decommissioning | is_draining |
+----+---------+-------------------+--------------------+-------------+
|  1 | false   |               747 | false              | true        |
|  2 | false   |               742 | false              | true        |
|  3 | false   |               751 | false              | true        |
|  4 | false   |               758 | false              | true        |
|  5 | false   |               754 | false              | true        |
+----+---------+-------------------+--------------------+-------------+
(5 rows)
... (etc, indefinitely)

After running nearly an hour, the nodes are still not decommissioned, and the output hasn’t changed. The nodes are permanently dead, and not able to be revived. Any ideas on how to force the cluster to decommission the dead nodes?

This sounds suspiciously like https://github.com/cockroachdb/cockroach/issues/18219#issuecomment-332956993.

To verify this, could you run the following against each node:

./cockroach debug gossip-values --insecure | grep liveness

and check what what the liveness:i lines say about the decommissioning status for i=1..5.

Also I assume that your cluster is otherwise healthy.

(What I think you should see is then that the nodes are in fact decommissioned, but that the liveness system didn’t correctly pick this up. If so, restart one of the nodes, and run the command again against that node. See if it terminates.)

Sorry, should have included that info, it is.

Here’s the relevant output of the debug command (output is the same when run against all nodes):

"liveness:1": {NodeID:1 Epoch:25 Expiration:1507834382.747324297,0 Draining:true Decommissioning:true}
"liveness:2": {NodeID:2 Epoch:26 Expiration:1507831372.161097549,0 Draining:true Decommissioning:true}
"liveness:3": {NodeID:3 Epoch:30 Expiration:1507837037.605221000,0 Draining:true Decommissioning:true}
"liveness:4": {NodeID:4 Epoch:26 Expiration:1507835646.476976303,0 Draining:true Decommissioning:true}
"liveness:5": {NodeID:5 Epoch:26 Expiration:1507833353.724060936,0 Draining:true Decommissioning:true}

So it looks like you’re right, the liveness of the node has been updated to the Decommissioning state but that’s not reflected by the UI/node decommission command.

One slight snag there… When I terminate a node the stores are permanently removed (not ideal, I know, but I’m running this cluster as a proof-of-concept on ephemeral storage and with increased replication). I’m fine to kill one of the currently running nodes and bring up a new one, but I have a sense that’ll just kick this problem down the road, as I’ll then need to decommission the node I terminate.

I can give that a shot anyways if you would like me to confirm that that resolves the issue with the current nodes.

No need to restart, what you’ve shown me so far is proof enough that this is the same issue. I’ll make sure we prioritize accordingly, thanks for testing along and for reporting it!

1 Like