Cluster information:
1、9 nodes and all the versions are v2.0.6 community edition(secure mode).
2、Replication zones setting:default
3、Each node’s data is 1.7TB
Backgroud:
Usually,a lots of queries are executed on this cluster.And one day,I ran the sql “show databases;”.It took me 2 hours before I got the result(there are 245 databases).When I executed like “grant select on db_name.* to user_name”.It took me 45 min(110 tables in the database).So I think if I upgrade it,maybe it will be better.
Upgrade detail:
I just upgraded the cluster like the link below:
https://www.cockroachlabs.com/docs/v2.1/upgrade-cockroach-version.html#main-content
1、Use “pkill cockroach” kill the process.
2、Execute “ps -ef | grep cockroach” to check out , untill the cockroach process was gone.
3、Switch the cockroach file with v2.1.6 version.
4、Start the cluster node use like this command
cockroach start
–certs-dir=certs
–advertise-addr=
–port=
–join=
5、Check the outcome,untill the ranges_unavailable and ranges_underreplicated are zero in all nodes.
cockroach --certs-dir=certs/ --port= --host= node status --all
Problem:
And here the thing.Upgrade node 1 and node 2 are success.But when I done the same thing on the node 3,I can’t get any check result.No matter how long I wait.At the same time.On node 1 or node 2,if the check one node’s status like this:
cockroach --certs-dir=certs/ --port= --host= node status 1
It works.When I login the cluster,I can’t query anything.Even I executed “select * from system.users;”
Downgrade:
After up 5 steps.It seens failed.So I downgrade:node 1,node 2,node 3.After that .I restart all the nodes one by one.Currently ,the cluster still can’t get the node status and still can’t query.
IO and Net:
Use iotop and iostat to check.It looks like the nodes still synchronous data.
Log:
W190430 03:04:11.070695 2386 storage/node_liveness.go:441 [n8,hb] failed node liveness heartbeat: context deadline exceeded
W190430 03:04:11.130266 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:12.130361 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:13.130462 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:14.130565 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:15.130659 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:15.570810 2386 storage/node_liveness.go:504 [n8,hb] slow heartbeat took 4.5s
W190430 03:04:15.570844 2386 storage/node_liveness.go:441 [n8,hb] failed node liveness heartbeat: context deadline exceeded
W190430 03:04:16.130760 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:17.133088 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:18.133185 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W190430 03:04:19.133301 2387 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
I190430 03:04:19.134121 2348 server/status/runtime.go:219 [n8] runtime stats: 7.9 GiB RSS, 1263 goroutines, 1.2 GiB/1.4 GiB/3.0 GiB GO alloc/idle/total, 3.2 GiB/4.0 GiB CGO alloc/total, 30022.09cgo/sec, 5.90/0.55 %(u/s)time, 0.00 %gc (4x)
Is anyone knows how to fix it?