Upgrade from v2.0.6 to 2.1.6,node status can't get anything

(Edgar) #1

Cluster information:
1、9 nodes and all the versions are v2.0.6 community edition(secure mode).
2、Replication zones setting:default
3、Each node’s data is 1.7TB

Backgroud:
Usually,a lots of queries are executed on this cluster.And one day,I ran the sql “show databases;”.It took me 2 hours before I got the result(there are 245 databases).When I executed like “grant select on db_name.* to user_name”.It took me 45 min(110 tables in the database).So I think if I upgrade it,maybe it will be better.

Upgrade detail:
I just upgraded the cluster like the link below:
https://www.cockroachlabs.com/docs/v2.1/upgrade-cockroach-version.html#main-content

1、Use “pkill cockroach” kill the process.

2、Execute “ps -ef | grep cockroach” to check out , untill the cockroach process was gone.

3、Switch the cockroach file with v2.1.6 version.

4、Start the cluster node use like this command
cockroach start
–certs-dir=certs
–advertise-addr=
–port=
–join=

5、Check the outcome,untill the ranges_unavailable and ranges_underreplicated are zero in all nodes.
cockroach --certs-dir=certs/ --port= --host= node status --all

Problem:
And here the thing.Upgrade node 1 and node 2 are success.But when I done the same thing on the node 3,I can’t get any check result.No matter how long I wait.At the same time.On node 1 or node 2,if the check one node’s status like this:

cockroach --certs-dir=certs/ --port= --host= node status 1

It works.When I login the cluster,I can’t query anything.Even I executed “select * from system.users;”

Downgrade:
After up 5 steps.It seens failed.So I downgrade:node 1,node 2,node 3.After that .I restart all the nodes one by one.Currently ,the cluster still can’t get the node status and still can’t query.

IO and Net:
Use iotop and iostat to check.It looks like the nodes still synchronous data.

Log:
W190430 03:04:11.070695 2386 storage/node_liveness.go:441 [n8,hb] failed node liveness heartbeat: context deadline exceeded
W190430 03:04:11.130266 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:12.130361 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:13.130462 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:14.130565 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:15.130659 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:15.570810 2386 storage/node_liveness.go:504 [n8,hb] slow heartbeat took 4.5s
W190430 03:04:15.570844 2386 storage/node_liveness.go:441 [n8,hb] failed node liveness heartbeat: context deadline exceeded
W190430 03:04:16.130760 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:17.133088 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:18.133185 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:19.133301 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
I190430 03:04:19.134121 2348 server/status/runtime.go:219 [n8] runtime stats: 7.9 GiB RSS, 1263 goroutines, 1.2 GiB/1.4 GiB/3.0 GiB GO alloc/idle/total, 3.2 GiB/4.0 GiB CGO alloc/total, 30022.09cgo/sec, 5.90/0.55 %(u/s)time, 0.00 %gc (4x)

Is anyone knows how to fix it?

(Ron Arévalo) #2

Hey @edgar,

It would be useful if you can provide a debug zip so we can take a look at all the logs. We have instructions on how to generate that here and you can upload the debug zip here,

(Ron Arévalo) #3

Hey @edgar,

To follow up, could you provide us with a trace for the grant select on db_name.* to user_name statement?

From the built in sql client, you’d run the following:

  1. set trace = on
  2. grant select on db_name.* to user_name
  3. set trace = off
  4. show trace for session

This would print out a long trace that would be useful for debugging.

Thanks,

Ron

(Edgar) #4

I can’t upload the txt or zip file.Is that the pdf file ok for you to analysis?Or please give me your email.I will send it to you.

(Edgar) #5

After reboot and rejoin all the nodes.Execute the command again:“grant select on table db_name.* to user_name;”.112 tables took 7 min.Later I will upload the tracing log.

Thinking:
The situation like me:if I want a user can access all the data(so many databases and so many tables).What should I do without the admin role?Because the admin role only can use in the enterprise version.

(Edgar) #6

Thanks.I have solved it.Very simple operation.Just reboot all the nodes(make sure all the nodes down in the same time).After that,restart them one by one with the command --join=‘all_ip:port…’ .Wait untill all the nodes are rejoin the cluster.Two miniute later.Check …node status --all.Bingo ,it work.