Upgrade to v2.0-alpha.20180122

I’ve a 5 nodes idle cluster running v2.0-alpha.20171218. I shut down all nodes, and do a upgrade to v2.0-alpha.20180122. The upgrade seems to be success; however, the Consistency Checker Queue keeps growing

And then I may see some nodes become dead on the overview page, but it will come back to normal after a few minutes. However, it will become dead again later.

I check the log on the server, I may see some warnings like this for all nodes

W180126 16:52:57.404974 734198 kv/dist_sender.go:1261 [ts-poll,n1] have been waiting 1m0s sending RPC to r9005 (currently pending: [(n5,s5):3]) for batch: Merge [/System/tsd/cr.store.tscache.skl.write.rotations/1/10s/2018-01-26T16:00:00Z,/Min) W180126 16:52:57.405259 228 kv/dist_sender.go:1261 [ts-poll,n1] have been waiting 1m0s sending RPC to r21 (currently pending: [(n3,s3):8]) for batch: Merge [/System/tsd/cr.store.valcount/1/10s/2018-01-26T16:00:00Z,/Min) W180126 16:52:57.405306 734199 kv/dist_sender.go:1261 [ts-poll,n1] have been waiting 1m0s sending RPC to r3434 (currently pending: [(n1,s1):4]) for batch: Merge [/System/tsd/cr.store.valbytes/1/10s/2018-01-26T16:00:00Z,/Min) W180126 16:53:00.329842 249 storage/node_liveness.go:426 [n1,hb] slow heartbeat took 4.5s W180126 16:53:00.329860 249 storage/node_liveness.go:365 [n1,hb] failed node liveness heartbeat: context deadline exceeded W180126 16:53:00.812591 735175 storage/node_liveness.go:426 [n1,s1,r1028/7:/System/tsd/cr.node.sql.…] slow heartbeat took 6.2s W180126 16:53:00.812967 735177 storage/node_liveness.go:426 [n1,s1,r344/1:/System/tsd/cr.node.sql.mem.a…] slow heartbeat took 6.2s W180126 16:53:00.813158 735148 storage/node_liveness.go:426 [n1,s1,r9218/6:/System/tsd/cr.node.sql.mem.a…] slow heartbeat took 6.2s W180126 16:53:00.813452 735179 storage/node_liveness.go:426 [n1,s1,r72/1:/System/tsd/cr.node.sql.exec.…] slow heartbeat took 6.2s W180126 16:53:00.813757 735234 storage/node_liveness.go:426 [n1,s1,r1544/6:/System/tsd/cr.node.distsende…] slow heartbeat took 6.2s W180126 16:53:00.813939 735194 storage/node_liveness.go:426 [n1,s1,r9260/4:/System/tsd/cr.node.sql.dists…] slow heartbeat took 6.2s W180126 16:53:00.814265 735225 storage/node_liveness.go:426 [n1,s1,r9303/1:/System/tsd/cr.node.sql.exec.…] slow heartbeat took 6.2s W180126 16:53:00.814507 735238 storage/node_liveness.go:426 [n1,s1,r9025/1:/System/tsd/cr.node.sql.exec.…] slow heartbeat took 6.2s W180126 16:53:00.814718 735227 storage/node_liveness.go:426 [n1,s1,r9457/4:/System/tsd/cr.node.{di…-ex…}] slow heartbeat took 6.2s W180126 16:53:00.815050 735100 storage/node_liveness.go:426 [n1,s1,r231/7:/System/tsd/cr.node.sql.mem.a…] slow heartbeat took 6.2s W180126 16:53:00.815257 735230 storage/node_liveness.go:426 [n1,s1,r3369/1:/System/tsd/cr.node.gossip.…] slow heartbeat took 6.2s W180126 16:53:00.815521 735243 storage/node_liveness.go:426 [n1,s1,r206/1:/System/tsd/cr.node.gossip.co…] slow heartbeat took 6.2s W180126 16:53:00.815807 735245 storage/node_liveness.go:426 [n1,s1,r9038/1:/System/tsd/cr.node.round-tri…] slow heartbeat took 6.2s W180126 16:53:00.816024 735251 storage/node_liveness.go:426 [n1,s1,r3486/1:/System/tsd/cr.node.round-tri…] slow heartbeat took 6.2s W180126 16:53:00.816294 735104 storage/node_liveness.go:426 [n1,s1,r6147/1:/System/tsd/cr.node.gossip.in…] slow heartbeat took 6.2s W180126 16:53:00.816585 735253 storage/node_liveness.go:426 [n1,s1,r7446/1:/System/tsd/cr.node.sql.mem.c…] slow heartbeat took 6.2s W180126 16:53:00.816798 735267 storage/node_liveness.go:426 [n1,s1,r9096/1:/System/tsd/cr.node.sql.dists…] slow heartbeat took 6.2s W180126 16:53:00.817008 735269 storage/node_liveness.go:426 [n1,s1,r9092/1:/System/tsd/cr.node.sql.mem.c…] slow heartbeat took 6.2s W180126 16:53:00.817234 735271 storage/node_liveness.go:426 [n1,s1,r204/1:/System/tsd/cr.node.sql.mem.c…] slow heartbeat took 6.2s I180126 16:53:00.817613 174 storage/replica_proposal.go:196 [n1,s1,r1028/7:/System/tsd/cr.node.sql.…] new range lease repl=(n1,s1):7 seq=0 start=1516977068.248659932,5 epo=126 pro=1516985574.658469514,0 following repl=(n1,s1):7 seq=0 start=1516977068.248659932,5 epo=126 pro=1516984961.125563733,0
From dstat, all nodes have read/write actions constantly , and it looks like one of the CPU thread would keep busy.

dstat

The cluster become very unstable. I then downgrade to v2.0-alpha.20171218, the Consistency Checker Queue eventually clear and the cluster looks to be back to normal.

Leo

Hi Leo,

Thanks for reaching out! This looks like an issue that we will want to investigate. Would you mind filing a new issue in our Github issue tracker? We’ll be able to better assign and track the investigation that way.

We have added some additional functionality to the consistency checker in between those two versions, but I don’t have any specific root cause for you right now.

Regards,
Matt T

I’ve actually gone ahead and logged the issue myself: https://github.com/cockroachdb/cockroach/issues/21824

Thank you very much matt. I will follow on that ticket.