Upgrade from v2.0.6 to 2.1.6,node status can't get anything

Cluster information:
1、9 nodes and all the versions are v2.0.6 community edition(secure mode).
2、Replication zones setting:default
3、Each node’s data is 1.7TB

Backgroud:
Usually,a lots of queries are executed on this cluster.And one day,I ran the sql “show databases;”.It took me 2 hours before I got the result(there are 245 databases).When I executed like “grant select on db_name.* to user_name”.It took me 45 min(110 tables in the database).So I think if I upgrade it,maybe it will be better.

Upgrade detail:
I just upgraded the cluster like the link below:
https://www.cockroachlabs.com/docs/v2.1/upgrade-cockroach-version.html#main-content

1、Use “pkill cockroach” kill the process.

2、Execute “ps -ef | grep cockroach” to check out , untill the cockroach process was gone.

3、Switch the cockroach file with v2.1.6 version.

4、Start the cluster node use like this command
cockroach start
–certs-dir=certs
–advertise-addr=
–port=
–join=

5、Check the outcome,untill the ranges_unavailable and ranges_underreplicated are zero in all nodes.
cockroach --certs-dir=certs/ --port= --host= node status --all

Problem:
And here the thing.Upgrade node 1 and node 2 are success.But when I done the same thing on the node 3,I can’t get any check result.No matter how long I wait.At the same time.On node 1 or node 2,if the check one node’s status like this:

cockroach --certs-dir=certs/ --port= --host= node status 1

It works.When I login the cluster,I can’t query anything.Even I executed “select * from system.users;”

Downgrade:
After up 5 steps.It seens failed.So I downgrade:node 1,node 2,node 3.After that .I restart all the nodes one by one.Currently ,the cluster still can’t get the node status and still can’t query.

IO and Net:
Use iotop and iostat to check.It looks like the nodes still synchronous data.

Log:
W190430 03:04:11.070695 2386 storage/node_liveness.go:441 [n8,hb] failed node liveness heartbeat: context deadline exceeded
W190430 03:04:11.130266 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:12.130361 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:13.130462 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:14.130565 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:15.130659 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:15.570810 2386 storage/node_liveness.go:504 [n8,hb] slow heartbeat took 4.5s
W190430 03:04:15.570844 2386 storage/node_liveness.go:441 [n8,hb] failed node liveness heartbeat: context deadline exceeded
W190430 03:04:16.130760 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:17.133088 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:18.133185 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:19.133301 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
I190430 03:04:19.134121 2348 server/status/runtime.go:219 [n8] runtime stats: 7.9 GiB RSS, 1263 goroutines, 1.2 GiB/1.4 GiB/3.0 GiB GO alloc/idle/total, 3.2 GiB/4.0 GiB CGO alloc/total, 30022.09cgo/sec, 5.90/0.55 %(u/s)time, 0.00 %gc (4x)

Is anyone knows how to fix it?

Hey @edgar,

It would be useful if you can provide a debug zip so we can take a look at all the logs. We have instructions on how to generate that here and you can upload the debug zip here,

Hey @edgar,

To follow up, could you provide us with a trace for the grant select on db_name.* to user_name statement?

From the built in sql client, you’d run the following:

  1. set trace = on
  2. grant select on db_name.* to user_name
  3. set trace = off
  4. show trace for session

This would print out a long trace that would be useful for debugging.

Thanks,

Ron

I can’t upload the txt or zip file.Is that the pdf file ok for you to analysis?Or please give me your email.I will send it to you.

After reboot and rejoin all the nodes.Execute the command again:“grant select on table db_name.* to user_name;”.112 tables took 7 min.Later I will upload the tracing log.

Thinking:
The situation like me:if I want a user can access all the data(so many databases and so many tables).What should I do without the admin role?Because the admin role only can use in the enterprise version.

Thanks.I have solved it.Very simple operation.Just reboot all the nodes(make sure all the nodes down in the same time).After that,restart them one by one with the command --join=‘all_ip:port…’ .Wait untill all the nodes are rejoin the cluster.Two miniute later.Check …node status --all.Bingo ,it work.