Upgrade from v2.0.4 to v2.0.5 breaks nodes in various ways


(Fatima M) #1

I took a 6 node v2.0.4 cluster and decommissioned 1 node, which was then replaced by 1 v2.0.5 node in the same cluster. Now the nodes in this cluster seem broken in various ways and I am not sure if the cluster is healthy to continue the rolling upgrade:

  1. Different node status reports by the nodes, first 2 not even reporting:
> for IP in 10.207.50.244 10.207.45.68 10.187.108.20 10.187.105.7 10.187.106.120 10.207.34.108 ; do echo "==== $IP ====" ; ssh -t $USER@$IP "sudo cockroach node status --certs-dir=/cockroach/certs --host=$IP --port=11400" ; done
==== 10.207.50.244 ====
Error: rpc error: code = Unknown desc = unable to get liveness for 7: node not in the liveness table
Failed running "node"
==== 10.207.45.68 ====
Error: rpc error: code = Unknown desc = unable to get liveness for 7: node not in the liveness table
Failed running "node"
==== 10.187.108.20 ====
+----+----------------------+--------+---------------------+---------------------+---------+
| id |       address        | build  |     updated_at      |     started_at      | is_live |
+----+----------------------+--------+---------------------+---------------------+---------+
|  2 | 10.207.50.244:11400  | v2.0.4 | 2018-09-11 21:05:22 | 2018-09-11 17:21:02 | false   |
|  3 | 10.207.45.68:11400   | v2.0.4 | 2018-09-11 21:05:23 | 2018-09-11 17:21:03 | true    |
|  4 | 10.187.108.20:11400  | v2.0.4 | 2018-09-11 21:05:24 | 2018-09-11 17:24:23 | true    |
|  5 | 10.187.105.7:11400   | v2.0.4 | 2018-09-11 21:05:25 | 2018-09-11 17:24:25 | true    |
|  6 | 10.187.106.120:11400 | v2.0.4 | 2018-09-11 21:05:18 | 2018-09-11 17:24:28 | true    |
|  7 | 10.207.34.108:11400  | v2.0.5 | 2018-09-11 21:05:19 | 2018-09-11 20:18:08 | true    |
+----+----------------------+--------+---------------------+---------------------+---------+
(6 rows)
==== 10.187.105.7 ====
+----+----------------------+--------+---------------------+---------------------+---------+
| id |       address        | build  |     updated_at      |     started_at      | is_live |
+----+----------------------+--------+---------------------+---------------------+---------+
|  2 | 10.207.50.244:11400  | v2.0.4 | 2018-09-11 21:05:22 | 2018-09-11 17:21:02 | false   |
|  3 | 10.207.45.68:11400   | v2.0.4 | 2018-09-11 21:05:23 | 2018-09-11 17:21:03 | true    |
|  4 | 10.187.108.20:11400  | v2.0.4 | 2018-09-11 21:05:24 | 2018-09-11 17:24:23 | true    |
|  5 | 10.187.105.7:11400   | v2.0.4 | 2018-09-11 21:05:25 | 2018-09-11 17:24:25 | true    |
|  6 | 10.187.106.120:11400 | v2.0.4 | 2018-09-11 21:05:28 | 2018-09-11 17:24:28 | true    |
|  7 | 10.207.34.108:11400  | v2.0.5 | 2018-09-11 21:05:19 | 2018-09-11 20:18:08 | true    |
+----+----------------------+--------+---------------------+---------------------+---------+
(6 rows)
==== 10.187.106.120 ====
+----+----------------------+--------+---------------------+---------------------+---------+
| id |       address        | build  |     updated_at      |     started_at      | is_live |
+----+----------------------+--------+---------------------+---------------------+---------+
|  2 | 10.207.50.244:11400  | v2.0.4 | 2018-09-11 21:05:22 | 2018-09-11 17:21:02 | false   |
|  3 | 10.207.45.68:11400   | v2.0.4 | 2018-09-11 21:05:23 | 2018-09-11 17:21:03 | true    |
|  4 | 10.187.108.20:11400  | v2.0.4 | 2018-09-11 21:05:24 | 2018-09-11 17:24:23 | true    |
|  5 | 10.187.105.7:11400   | v2.0.4 | 2018-09-11 21:05:25 | 2018-09-11 17:24:25 | true    |
|  6 | 10.187.106.120:11400 | v2.0.4 | 2018-09-11 21:05:28 | 2018-09-11 17:24:28 | true    |
|  7 | 10.207.34.108:11400  | v2.0.5 | 2018-09-11 21:05:29 | 2018-09-11 20:18:08 | true    |
+----+----------------------+--------+---------------------+---------------------+---------+
(6 rows)
==== 10.207.34.108 ====
+----+----------------------+--------+---------------------+---------------------+---------+
| id |       address        | build  |     updated_at      |     started_at      | is_live |
+----+----------------------+--------+---------------------+---------------------+---------+
|  2 | 10.207.50.244:11400  | v2.0.4 | 2018-09-11 21:05:22 | 2018-09-11 17:21:02 | false   |
|  3 | 10.207.45.68:11400   | v2.0.4 | 2018-09-11 21:05:23 | 2018-09-11 17:21:03 | false   |
|  4 | 10.187.108.20:11400  | v2.0.4 | 2018-09-11 21:05:24 | 2018-09-11 17:24:23 | false   |
|  5 | 10.187.105.7:11400   | v2.0.4 | 2018-09-11 21:05:25 | 2018-09-11 17:24:25 | false   |
|  6 | 10.187.106.120:11400 | v2.0.4 | 2018-09-11 21:05:28 | 2018-09-11 17:24:28 | false   |
|  7 | 10.207.34.108:11400  | v2.0.5 | 2018-09-11 21:05:29 | 2018-09-11 20:18:08 | true    |
+----+----------------------+--------+---------------------+---------------------+---------+
(6 rows)
  1. The nodes are reporting differently in the UI, and showing UNAVAILABLE RANGES:

PS: Was told to tag @asubiotto


(Alfonso Subiotto Marqués) #2

Hi @fat0,

I created a github issue for this: https://github.com/cockroachdb/cockroach/issues/30142, let’s continue the discussion there if that sounds good to you.


(Alfonso Subiotto Marqués) #3

@fat0, could you provide us with some more information on the github issue?


(Fatima M) #4

Did more testing on our end, upgrade in place without decommissioning hosts works. Upgrading by replacing hosts fails. Posted this in the github issue as well.