All node is down, failed to restart

Hi All,
I try to deploy cockroachdb in secure mode to rancher(kubernetes). Everything is going well, 3 nodes are running well in months. Yesterday, one of the rancher worker node failed. All of the microservices run on that worker node failed to run, the others (run on other worker node) are still run well. But all of cockroachdb nodes are down even not run on that failed worker node. I try to restart all the node, but all is failed to start. here is the log:

E200819 04:32:26.949080 103925 server/admin.go:1339 node not in the liveness table

8/19/2020 11:32:27 AM W200819 04:32:27.297475 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:28 AM W200819 04:32:28.238318 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:28 AM I200819 04:32:28.520045 103850 kv/txn.go:739 [n1] async rollback failed: aborted in distSender: context deadline exceeded

8/19/2020 11:32:28 AM W200819 04:32:28.763871 181 server/node.go:749 [n1] [n1,s1]: unable to compute metrics: [n1,s1]: system config not yet available

8/19/2020 11:32:28 AM W200819 04:32:28.824876 179 kv/kvserver/store_rebalancer.go:223 [n1,s1,store-rebalancer] StorePool missing descriptor for local store

8/19/2020 11:32:29 AM I200819 04:32:29.099034 187 server/status/runtime.go:498 [n1] runtime stats: 169 MiB RSS, 154 goroutines, 59 MiB/55 MiB/121 MiB GO alloc/idle/total, 14 MiB/24 MiB CGO alloc/total, 9.8 CGO/sec, 3.2/0.2 %(u/s)time, 0.0 %gc (1x), 3.9 KiB/9.9 KiB (r/w)net

8/19/2020 11:32:29 AM W200819 04:32:29.376519 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:30 AM W200819 04:32:30.373194 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:31 AM W200819 04:32:31.236821 103999 kv/kvserver/replica_range_lease.go:554 can’t determine lease status due to node liveness error: node not in the liveness table

8/19/2020 11:32:31 AM github.com/cockroachdb/cockroach/pkg/kv/kvserver.init

8/19/2020 11:32:31 AM /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/node_liveness.go:44

8/19/2020 11:32:31 AM runtime.doInit

8/19/2020 11:32:31 AM /usr/local/go/src/runtime/proc.go:5222

8/19/2020 11:32:31 AM runtime.doInit

8/19/2020 11:32:31 AM /usr/local/go/src/runtime/proc.go:5217

8/19/2020 11:32:31 AM runtime.doInit

8/19/2020 11:32:31 AM /usr/local/go/src/runtime/proc.go:5217

8/19/2020 11:32:31 AM runtime.doInit

8/19/2020 11:32:31 AM /usr/local/go/src/runtime/proc.go:5217

8/19/2020 11:32:31 AM runtime.main

8/19/2020 11:32:31 AM /usr/local/go/src/runtime/proc.go:190

8/19/2020 11:32:31 AM runtime.goexit

8/19/2020 11:32:31 AM /usr/local/go/src/runtime/asm_amd64.s:1357

8/19/2020 11:32:31 AM W200819 04:32:31.471022 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:31 AM E200819 04:32:31.952367 103975 server/admin.go:1339 node not in the liveness table

8/19/2020 11:32:32 AM W200819 04:32:32.396640 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:33 AM W200819 04:32:33.258134 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:34 AM W200819 04:32:34.263981 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:35 AM W200819 04:32:35.283587 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:36 AM W200819 04:32:36.185930 173 kv/kvserver/store.go:1631 [n1,s1,r6/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r6: replica (n1,s1):1 not lease holder; lease holder unknown

8/19/2020 11:32:36 AM W200819 04:32:36.238119 104057 kv/kvserver/replica_range_lease.go:554 can’t determine lease status due to node liveness error: node not in the liveness table

8/19/2020 11:32:36 AM github.com/cockroachdb/cockroach/pkg/kv/kvserver.init

8/19/2020 11:32:36 AM /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/node_liveness.go:44

I wonder if this is running into https://github.com/cockroachdb/cockroach/issues/37906. This issue is severe but we should be addressing it for the upcoming release.

I try to trace again and found that the problem is not cockroachdb, but because the cluster agent and coredns failed to recover after worker node of container down. That cause any of cockroachdb workload cannot connect to outside network and they cannot gossip each other.
I just try to restart cluster agent and coredns on rancher system and now everything is working well.

Thanks for your reply