Volume node affinity conflict kubernetes gcp

I have deployed cockroach db on kubernetes with 3 nodes on gcp. I have used n1-highcpu-2 node pool and each nodes are running in asia-south1-b, asia-south1-c. asia-south1-a availability zone and set pod replica to 3. I have used pd-ssd volume when i deployed cockroach db in kubernetes. It was working fine from long time but suddenly i have observed error on cockroach db pod. cockroachdb-0 pod is showing Unscheduled and keep restarting, and throwing this error "0/3 nodes are available: 1 node(s) had taints that the pod didn’t tolerate, 2 node(s) had volume node affinity conflict.
"
Also one node in which cockroachdb-0 pod is running is also throwing Memorypressure error.

Kindly guide me to resolve this issue. Also let me know if any other information require.

Hey @vishal,

Could you send us your debug zip, you could upload it here.

Also as an aside, n1-highcpu-2 machines are under-provisioned for running CockroachDB with any sort of complex workload. The ideal configuration is 4-16 vCPUs, 8-64 GB memory nodes (2-4 GB of memory per vCPU).

Thanks!

@ronarev I am unable to generate debug file. it showing that there is no such a directory.

❯ kubectl exec -it cockroachdb-client-secure -- ./cockroach debug zip ./cockroach-data/logs/debug.zip --certs-dir=/cockroach-certs --host=cockroachdb-public
Error: open ./cockroach-data/logs/debug.zip: no such file or directory
Failed running "debug zip"
command terminated with exit code 1

Moreover, Please find below stackdriver logs if it help.

rpc/nodedialer/nodedialer.go:143 [n3] unable to connect to n1: failed to connect to n1 at cockroachdb-0.cockroachdb.default.svc.cluster.local:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb.default.svc.cluster.local: no such host"

W190912 12:39:05.623207 311821 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {cockroachdb-0.cockroachdb.default.svc.cluster.local:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...

W190912 12:39:05.623398 311821 vendor/google.golang.org/grpc/clientconn.go:953 Failed to dial cockroachdb-0.cockroachdb.default.svc.cluster.local:26257: context canceled; please retry.

Hey @vishal,

Could you ssh into the pod and run the debug zip command.

You should just be able to run the command ./cockroach debug zip debug --certs-dir=certs and this would generate a new zip file called debug.

Thanks!

@ronarev I tired to create debug file from login into pod. but getting error to load certs.

❯ k exec -it cockroachdb-0 -- bash
root@cockroachdb-0:/cockroach# ls -l                                        
total 121556
-rwxr-xr-x 1 root root 124460696 Aug 13 19:19 cockroach
drwxrwxrwx 2 root root      4096 Sep 16 04:22 cockroach-certs
drwxr-xr-x 6 root root      4096 Sep 16 04:26 cockroach-data
-rwxr-xr-x 1 root root       120 Aug 13 19:18 cockroach.sh
root@cockroachdb-0:/cockroach# ./cockroach debug zip debug --certs-dir=cockroach-certs
Error: cannot load certificates.
Check your certificate settings, set --certs-dir, or use --insecure for insecure clusters.

problem with client cert for user root: not found
Failed running "debug zip"
root@cockroachdb-0:/cockroach# 


@ronarev Is there any proper way to create debug file for kubernetes deployed cockroach ?

Hey @vishal,

Is your certs-dir called certs? You should run the debug command with --certs-dir=<THE NAME OF YOUR DIR WHERE CERTS ARE SAVED> flag.

This command should run without any issues so long as you have the right certs-dir. Unless you are running an insecure cluster then you’d need to pass the --insecure flag to the debug zip command.

Thanks

@ronarev, My cockroach cert directory is /cockroach-certs , And i have created debug file using below command from my local machine.

kubectl exec -it cockroachdb-client-secure -- ./cockroach debug zip debug.zip --certs-dir=/cockroach-certs --host=cockroachdb-public

Also I have uploaded it to this link, https://cockroachlabs.sendsafely.com/u/ron, Please check it and let me know if something wrong and want to change anything.

Thanks!.

Hey @vishal,

Based on the logs, it shows the nodes are restarting very often, I would suggest moving to a higher provisioned machine on GCP to see if that alleviates the issues. I also noticed that you didn’t pass the --max-sql-mem flag and instead it’s using the default setting , --max-sql-memory (128 MiB). You may also want to try setting that flag in your start command to be 25%.

I would also checkout our production checklist for a few more best practices about deployment.

Thanks!

@ronarev Thank for the reply, We were using --max-sql-memory 25% before the issue. But i removed it to check if the issue coming from memory. Also We are using same database clusters on different environment with same GCP node configuration. but all are working fine except this.

Hey @vishal,

Was this happening when there was load on the cluster, or it didn’t matter? Were you running any jobs like schema changes or index creation?

Also, could you send over the Kubernetes logs, they may show some more information about what is causing the restarts.

Thanks!

Hey @ronarev No we didn’t run any jobs like schema changes or index creation during that time. As you said that it happening when load on cluster. But there was not any load on cluster when issue came and suppose if it was due to load but issue could solved after lode comes down. and due to the issue i had to scale down the cluster.

Aside from that we have created new k8s cluster and imported databases into that and it works fine.

But still we have to investigate the issue of the cluster. So i have attached that k8s cluster logs here. Kindly look into it and let me know if you found any issue.

https://cockroachlabs.sendsafely.com/receive/?thread=FU96-JMSM&packageCode=Qqj0haG0hpM4Pl8b6PVAkuuDqIe8BM0b2iB1jErZkmE#keyCode=XtKp79hdcQPvLFlPkb2tRglDmJEURnfTNLPdHHdCRCQ

@ronarev Did you found any issue from error logs? I just want to know from logs that what was the cause of the pod restart. So that we know that what we should do in future if again issue comes.

Hey Vishal,

The logs you sent over didn’t tell us what caused the node to restart. Are you still having this issue? Have you increased the vCPU and Memory of your instances?

@ronarev No, We are not facing issue currently as we have deleted that cluster and created new one with same configuration, Also we didn’t increase vCPU and memory of the instances.

@ronarev Increase resources are not an ideal solution when you have no load on cluster. I didn’t see any load on kubernetes nodes. Same databases running fine with new cluster. Can you please identify the root cause of the issue ? I can provide you logs and debug if you want.

Hey @vishal,

Sure, send over the debug zip using the same link I provided previously.

Thanks

@ronarev I have uploaded debug zip on the link you mentioned, Please check.

https://cockroachlabs.sendsafely.com/receive/?thread=4EE3-0FSK&packageCode=NgmN8U0Yql4h8SOP3zrQj1f0BPJKtaTThhE52ReDch0#keyCode=ic6ePTq0WuvgFMeJ-Ms1qYsdLeCYjcNDHDFdl4HfT7E

Hey @vishal,

The logs show that all three nodes received a signal to terminate:

1/logs/cockroach.cockroachdb-0.root.2019-09-09T01_19_55Z.000001.log:I190909 02:06:25.520228 1 cli/start.go:765 received signal 'terminated'

This most likely came from the kubernetes pod, if you could send over the pod logs they might tell us if there was a sigterm sent. The command should be kubectl logs <MY PODS NAME HERE>

Node 1 has by far the most restarts, but I’d like to see the logs for all the pods.

Thanks

@ronarev Sorry for the late replay. I have attached the pod logs to below link, Kindly check it and let me know if you found any issue.

https://cockroachlabs.sendsafely.com/receive/?thread=ZZFF-XMT2&packageCode=jToxP00XftGIJXNIG0Zp37b0Y0hv9b0Chi65NX3TtYw#keyCode=Nh926nimhZlWxUtCaoCY6TD7mHgto6FAQlxWf0wmx2Q