Upgrade from v2.0.4 to v2.0.5 breaks nodes in various ways

I took a 6 node v2.0.4 cluster and decommissioned 1 node, which was then replaced by 1 v2.0.5 node in the same cluster. Now the nodes in this cluster seem broken in various ways and I am not sure if the cluster is healthy to continue the rolling upgrade:

  1. Different node status reports by the nodes, first 2 not even reporting:
> for IP in 10.207.50.244 10.207.45.68 10.187.108.20 10.187.105.7 10.187.106.120 10.207.34.108 ; do echo "==== $IP ====" ; ssh -t $USER@$IP "sudo cockroach node status --certs-dir=/cockroach/certs --host=$IP --port=11400" ; done
==== 10.207.50.244 ====
Error: rpc error: code = Unknown desc = unable to get liveness for 7: node not in the liveness table
Failed running "node"
==== 10.207.45.68 ====
Error: rpc error: code = Unknown desc = unable to get liveness for 7: node not in the liveness table
Failed running "node"
==== 10.187.108.20 ====
+----+----------------------+--------+---------------------+---------------------+---------+
| id |       address        | build  |     updated_at      |     started_at      | is_live |
+----+----------------------+--------+---------------------+---------------------+---------+
|  2 | 10.207.50.244:11400  | v2.0.4 | 2018-09-11 21:05:22 | 2018-09-11 17:21:02 | false   |
|  3 | 10.207.45.68:11400   | v2.0.4 | 2018-09-11 21:05:23 | 2018-09-11 17:21:03 | true    |
|  4 | 10.187.108.20:11400  | v2.0.4 | 2018-09-11 21:05:24 | 2018-09-11 17:24:23 | true    |
|  5 | 10.187.105.7:11400   | v2.0.4 | 2018-09-11 21:05:25 | 2018-09-11 17:24:25 | true    |
|  6 | 10.187.106.120:11400 | v2.0.4 | 2018-09-11 21:05:18 | 2018-09-11 17:24:28 | true    |
|  7 | 10.207.34.108:11400  | v2.0.5 | 2018-09-11 21:05:19 | 2018-09-11 20:18:08 | true    |
+----+----------------------+--------+---------------------+---------------------+---------+
(6 rows)
==== 10.187.105.7 ====
+----+----------------------+--------+---------------------+---------------------+---------+
| id |       address        | build  |     updated_at      |     started_at      | is_live |
+----+----------------------+--------+---------------------+---------------------+---------+
|  2 | 10.207.50.244:11400  | v2.0.4 | 2018-09-11 21:05:22 | 2018-09-11 17:21:02 | false   |
|  3 | 10.207.45.68:11400   | v2.0.4 | 2018-09-11 21:05:23 | 2018-09-11 17:21:03 | true    |
|  4 | 10.187.108.20:11400  | v2.0.4 | 2018-09-11 21:05:24 | 2018-09-11 17:24:23 | true    |
|  5 | 10.187.105.7:11400   | v2.0.4 | 2018-09-11 21:05:25 | 2018-09-11 17:24:25 | true    |
|  6 | 10.187.106.120:11400 | v2.0.4 | 2018-09-11 21:05:28 | 2018-09-11 17:24:28 | true    |
|  7 | 10.207.34.108:11400  | v2.0.5 | 2018-09-11 21:05:19 | 2018-09-11 20:18:08 | true    |
+----+----------------------+--------+---------------------+---------------------+---------+
(6 rows)
==== 10.187.106.120 ====
+----+----------------------+--------+---------------------+---------------------+---------+
| id |       address        | build  |     updated_at      |     started_at      | is_live |
+----+----------------------+--------+---------------------+---------------------+---------+
|  2 | 10.207.50.244:11400  | v2.0.4 | 2018-09-11 21:05:22 | 2018-09-11 17:21:02 | false   |
|  3 | 10.207.45.68:11400   | v2.0.4 | 2018-09-11 21:05:23 | 2018-09-11 17:21:03 | true    |
|  4 | 10.187.108.20:11400  | v2.0.4 | 2018-09-11 21:05:24 | 2018-09-11 17:24:23 | true    |
|  5 | 10.187.105.7:11400   | v2.0.4 | 2018-09-11 21:05:25 | 2018-09-11 17:24:25 | true    |
|  6 | 10.187.106.120:11400 | v2.0.4 | 2018-09-11 21:05:28 | 2018-09-11 17:24:28 | true    |
|  7 | 10.207.34.108:11400  | v2.0.5 | 2018-09-11 21:05:29 | 2018-09-11 20:18:08 | true    |
+----+----------------------+--------+---------------------+---------------------+---------+
(6 rows)
==== 10.207.34.108 ====
+----+----------------------+--------+---------------------+---------------------+---------+
| id |       address        | build  |     updated_at      |     started_at      | is_live |
+----+----------------------+--------+---------------------+---------------------+---------+
|  2 | 10.207.50.244:11400  | v2.0.4 | 2018-09-11 21:05:22 | 2018-09-11 17:21:02 | false   |
|  3 | 10.207.45.68:11400   | v2.0.4 | 2018-09-11 21:05:23 | 2018-09-11 17:21:03 | false   |
|  4 | 10.187.108.20:11400  | v2.0.4 | 2018-09-11 21:05:24 | 2018-09-11 17:24:23 | false   |
|  5 | 10.187.105.7:11400   | v2.0.4 | 2018-09-11 21:05:25 | 2018-09-11 17:24:25 | false   |
|  6 | 10.187.106.120:11400 | v2.0.4 | 2018-09-11 21:05:28 | 2018-09-11 17:24:28 | false   |
|  7 | 10.207.34.108:11400  | v2.0.5 | 2018-09-11 21:05:29 | 2018-09-11 20:18:08 | true    |
+----+----------------------+--------+---------------------+---------------------+---------+
(6 rows)
  1. The nodes are reporting differently in the UI, and showing UNAVAILABLE RANGES:

PS: Was told to tag @asubiotto

Hi @fat0,

I created a github issue for this: https://github.com/cockroachdb/cockroach/issues/30142, let’s continue the discussion there if that sounds good to you.

@fat0, could you provide us with some more information on the github issue?

Did more testing on our end, upgrade in place without decommissioning hosts works. Upgrading by replacing hosts fails. Posted this in the github issue as well.