Kubernetes pod failed and won't restart

I have a small managed Kubernetes cluster at Digital Ocean with a secure CockroachDB cluster as deployed with helm.

I was looking at unrelated things when I noticed one of my cockroachdb pods was in the CrashLoopBackoff state.

Further investigation shows this:

      Warning  Unhealthy  28m (x599 over 2d20h)    kubelet, pubserv-btui  Readiness probe failed: Get https://10.244.0.82:8080/health?ready=1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
      Warning  Unhealthy  17m (x560 over 2d12h)    kubelet, pubserv-btui  Liveness probe failed: Get https://10.244.0.82:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
      Warning  Unhealthy  8m52s (x775 over 2d20h)  kubelet, pubserv-btui  Liveness probe failed: Get https://10.244.0.82:8080/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
      Warning  BackOff    3m47s (x632 over 28h)    kubelet, pubserv-btui  Back-off restarting failed container```

Here is the pod log output: http://paste.ubuntu.com/p/6PxxK8D9RY/

And the full output of kubectl describe pod: http://paste.ubuntu.com/p/RDwnH9hK86/

Thanks!

-brian

Turns out to have been resource exhaustion. There wasn’t enough memory left in the cluster. Really misleading error message though.

Hey @wonko

The error message you are seeing in reference to the CrashLoopBackoff state of the pod is the error coming from k8s. Since the default restart policy for a configured k8s pod is “Always” (check out the k8s docs here), this would mean that k8s always ries to restart a pod when it goes down. The CrashLookBackoff state is an indicator its tried to do that a high number of times.

The reason reported in the cockroach logs look like the server was unable to start correctly, due to an authentication error. This would cause the pod to be restarted, as k8s would try it over and over.

E190919 17:29:28.450594 486 server.go:2977  http: TLS handshake error from 10.244.0.1:37658: EOF
I190919 17:29:29.545154 409 cli/start.go:840  14 running tasks
W190919 17:29:30.582104 485 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190919 17:29:30.635677 493 vendor/google.golang.org/grpc/server.go:666  grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
W190919 17:29:30.735050 114 storage/store.go:3704  [n3,s3,r38/3:/Table/7{1-2}] handle raft ready: 12.6s [processed=0]
E190919 17:29:30.798208 495 server.go:2977  http: TLS handshake error from 10.244.0.1:37752: EOF
E190919 17:29:30.860518 492 server.go:2977  http: TLS handshake error from 10.244.0.1:37692: EOF
W190919 17:29:30.444072 113 storage/engine/rocksdb.go:2040  batch [1/51/0] commit took 504.036489ms (>= warning threshold 500ms)
I190919 17:29:31.907425 539 gossip/client.go:128  [n3] started gossip client to cockroachdb-shared-cockroachdb-2.cockroachdb-shared-cockroachdb.data-lake.svc.cluster.local:26257
W190919 17:29:35.351312 174 storage/node_liveness.go:523  [n3,hb] slow heartbeat took 5.7s
W190919 17:29:35.363183 174 storage/node_liveness.go:463  [n3,hb] failed node liveness heartbeat: operation "node liveness heartbeat" timed out after 4.5s
I190919 17:29:35.375590 409 cli/start.go:840  19 running tasks
W190919 17:29:35.546463 61 vendor/google.golang.org/grpc/clientconn.go:1304  grpc: addrConn.createTransport failed to connect to {cockroachdb-shared-cockroachdb-1.cockroachdb-shared-cockroachdb.data-lake.svc.cluster.local:26257 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: io: read/write on closed pipe". Reconnecting...

The CrashLookBackoff state is usually just an indicator that some more digging needs to be done :smile: Let me know if there are any other questions.

Cheers,
Ricardo

So at the time the issue ended up being a lack of resources in the cluster. Don’t ask me why it manifested itself as the error it did. So weird. :slight_smile:

Anyway, I’m having issue again. This time there are definitely enough cluster resources, but I’m still seeing issues. Here is the cockroachdb container log: https://gist.github.com/bhechinger/254ac42394fb9235563b7f66f1c5258b

Hey @wonko

In the logs we are seeing that the process is receiving a termination signal from the OS, so it goes ahead and performs that.

I191010 20:27:09.736100 1 cli/start.go:765  received signal 'terminated'
E191010 20:27:09.685520 585 server.go:2977  http: TLS handshake error from 10.244.7.121:54550: remote error: tls: bad certificate
I191010 20:27:09.754269 1 cli/start.go:830  initiating graceful shutdown of server
initiating graceful shutdown of server

I am aso seeing some bad certificates messages from other IP addresses, maybe the certs aren’t playing nice with one another? The node’s certs need to have any domain name or IP addressees in the Subject Alternative Name field of the node.cert file, and the CN=node field needs to appear as mentioned. I would dig into the k8s specific logs to see what is sending the termination signal to our process, and also double check the certs configuration.

Let me know what you find.

Cheers,
Ricardo

The certs are all handled by the helm chart. Let me dig around in the k8s logs though.