CockroachDB down. Is there problem in my setup / settings?

Hi!

CockroachDB went down after deleting 6/9 crdb nodes.

Our own settings may cause some problems also. Our setup could be incorrect in generic also.
I still wanted to send this. Maybe someone would be able to pinpoint where problems are.
All 9 nodes are in is_available=false state:

  id |                address                |  build  |            started_at            |            updated_at            | is_available | is_live 
+----+---------------------------------------+---------+----------------------------------+----------------------------------+--------------+---------+
   1 | <laptopIP>:26257                   | v19.1.5 | 2019-10-18 08:30:59.752066+00:00 | 2019-10-18 09:58:13.289858+00:00 | false        | true    
   2 | <laptopIP>:26258                   | v19.1.5 | 2019-10-18 08:31:00.280316+00:00 | 2019-10-18 09:58:13.849628+00:00 | false        | true    
   3 | <laptopIP>:26259                   | v19.1.5 | 2019-10-18 08:31:00.776517+00:00 | 2019-10-18 09:58:14.317735+00:00 | false        | true    
   4 | <lab0025vipDNS>:31852 | v19.1.5 | 2019-10-18 09:58:37.939029+00:00 | 2019-10-18 09:58:11.5407+00:00   | false        | true    
   5 | <lab0025vipDNS>:31850 | v19.1.5 | 2019-10-18 09:58:42.767674+00:00 | 2019-10-18 09:58:10.933123+00:00 | false        | false   
   6 | <lab0025vipDNS>:31851 | v19.1.5 | 2019-10-18 09:58:28.921486+00:00 | 2019-10-18 09:58:11.068483+00:00 | false        | true    
   7 | <lab0025vipDNS>:31860 | v19.1.5 | 2019-10-18 09:58:08.148036+00:00 | 2019-10-18 09:58:12.71007+00:00  | false        | true    
   8 | <lab0025vipDNS>:31861 | v19.1.5 | 2019-10-18 09:58:23.328977+00:00 | 2019-10-18 09:58:08.250499+00:00 | false        | true    
   9 | <lab0025vipDNS>:31862 | v19.1.5 | 2019-10-18 09:58:11.963717+00:00 | 2019-10-18 09:57:56.132863+00:00 | false        | true 

I had deleted ( == restarted), 2 DCs at one go:

kubectl delete pod crdboslo-cockroachdb-0 crdboslo1-cockroachdb-0 crdboslo2-cockroachdb-0 -noslo
Immediately also:
kubectl delete pod crdbtampere-cockroachdb-0 crdbtampere1-cockroachdb-0 crdbtampere2-cockroachdb-0 -ntampere

Then it did not get up anymore.

logs of pod of “tampere” locality:

191018 10:15:36.443436 293 storage/node_liveness.go:523  [n5,hb] slow heartbeat took 4.5s
W191018 10:15:36.443486 293 storage/node_liveness.go:463  [n5,hb] failed node liveness heartbeat: aborted in distSender: context deadline exceeded
I191018 10:15:36.634165 7656 gossip/client.go:128  [n5] started gossip client to <laptopIP>:26259
I191018 10:15:36.636931 7656 gossip/client.go:133  [n5] closing client to n3 (<laptopIP>:26259): received forward from n3 to 1 (<laptopIP>:26257)
I191018 10:15:36.637156 7588 gossip/client.go:128  [n5] started gossip client to <laptopIP>:26257
I191018 10:15:36.664776 7588 gossip/client.go:133  [n5] closing client to n1 (<laptopIP>:26257): received forward from n1 to 7 (<lab0025vipDNS>:31860)
W191018 10:15:36.665051 7456 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31860 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31860: connect: connection refused". Reconnecting...
W191018 10:15:37.040663 7660 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31861 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31861: connect: connection refused". Reconnecting...
W191018 10:15:37.040668 7593 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31851 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31851: connect: connection refused". Reconnecting...
W191018 10:15:37.040668 7670 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31852 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31852: connect: connection refused". Reconnecting...
W191018 10:15:37.040675 7611 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31862 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31862: connect: connection refused". Reconnecting...

logs of pod of oslo locality

W191018 10:18:15.559013 75714 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial <lab0025vipDNS>:31861: context canceled; please retry.
I191018 10:18:15.560408 75759 gossip/client.go:133  [n7] closing client to n3 (<laptopIP>:26259): received forward from n3 to n1 (<laptopIP>:26257); already have active connection, skipping
W191018 10:18:15.608416 75795 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31861 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31861: connect: connection refused". Reconnecting...
W191018 10:18:15.799085 75711 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31850 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W191018 10:18:15.799134 75516 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31852 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W191018 10:18:15.799164 75516 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial neo0025vip.netact.nsn-rdnet.net:31852: grpc: the connection is closing; please retry.
W191018 10:18:15.799140 75711 vendor/google.golang.org/grpc/clientconn.go:953  Failed to dial neo0025vip.netact.nsn-rdnet.net:31850: context canceled; please retry.
W191018 10:18:16.008828 75736 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31850 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31850: connect: connection refused". Reconnecting...
W191018 10:18:16.008919 75669 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31862 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31862: connect: connection refused". Reconnecting...
W191018 10:18:16.009003 75813 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {<lab0025vipDNS>:31852 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp <IP_of_lab0025vip>:31852: connect: connection refused". Reconnecting...

3 localities.
One locality is running on laptop (in docker-compose).
Two are running in one lab, in k8s, in two different namespaces.
3 crdb nodes in each locality. So, 9 crdb nodes in total.

In lab:
ConfigMap --> tcp-services in Ingress.
tcp-service in Ingress --> pod specific svc --> Certain pod

Each pod, in the lab, is installed with separate statefulset / separate own “helm install” execution.
(Workaround to get forward with testing.)

The system has recovered, if I have deleted (restarted) pods from only one DC / locality, 3 pods:

kubectl delete pod crdboslo-cockroachdb-0 crdboslo1-cockroachdb-0 crdboslo2-cockroachdb-0 -noslo

Now I deleted 6 pods in a sequence.

Hi @roachman

When removing servers you must decommission them one by one, waiting until the replicas rebalance.

If you suddenly delete them without the replicas rebalancing then you’ll suffer data loss and the cluster won’t be able to recover.

Most likely, when you deleted just one regional cluster, the other two regional clusters were able to recover because more than 50% of the replicas were available.

Hi!
Thanks for a comment.
I do not try decomission (or remove) crdb nodes.
“kubectl delete” only causes the pod to be restarted. It does not get removed, but it will be restarted only.

Connectivity in k8s may get broken / stuck also?
tcp-service in Ingress (ConfigMap), ClusterIP service?

CockroachDB mode: insecure

tcpdump shows that tcp traffic goes to tcp-service in Ingress.

But tcp traffic does not go from Ingress/svc to the pod. That part is silent.
How can I see is CockroachDB responding to the port 31850 in the pod.
With “./cockroach node status …” or “./cockroach sql …” I get only e.g. i/o timeout.

Auto heal of crdb:

  • Is auto heal node wide?
    So, are all ranges removed from the crdb node, and is the whole crdb node recreated?
    Are also the healthy ranges removed and recreated on the crdb node?
  • Or, is the auto heal: range specific?

Trying also with following setup:

All 9 crdb nodes running on laptop, with docker-compose. v19.1.4.

The problem has not been reproduced with such setup yet. Few test rounds have been executed.

CockroachDB seems to work ok,
if e.g. 8 crdb nodes are installed in one data center in k8s, in one locality. Then restart works ok.


3 DCs case: one locality in laptop, and 2 localities in k8s in lab.
So, in 3 DCs case,
there may be something wrong with k8s communication / connections. (Not in CockroachDB?)