HTTP probe failed with statuscode 503


(JABIL) #1

Hello,

After rebooting Kubernetes’ nodes there is an issue:
kubectl describe pod cockroach-cockroachdb-0

Events:
Type Reason Age From Message


Warning Unhealthy 49s (x2638 over 3h) kubelet, lab-arp-node2 Readiness probe failed: HTTP probe failed with statuscode: 503

I can enter into pod cockroach-cockroachdb-0, ‘cockroach sql --insecure’ doesn’t work.
‘./cockroach init --insecure’ leads to:
Error: rpc error: code = AlreadyExists desc = cluster has already been initialized with ID …

Is there any way to debug it? Why does it happen?

P.S. There are 3 cockroach nodes. One kubernetes node was in NotReady state, there was unstable work with db. Kubernetes nodes were restarted. After that we have 503 issue.


(Alex Robinson) #2

Hi @Serg,

I’m sorry to hear things aren’t working as expected! Would you mind providing a little more info? Specifically, it’d be great if you could provide:

  1. What the output of cockroach sql --insecure is on pod cockroach-cockroachdb-0.
  2. Does cockroach sql --insecure work on the other pods?
  3. Could you share the logs from all the pods with us/me? kubectl logs <pod-name> should grab them. If you don’t want to share them publicly, feel free to email them to alex@cockroachlabs.com.

(JABIL) #3

# ./cockroach sql --insecure
Only welcome output:
#Welcome to the cockroach SQL interface.
#All statements must be terminated by a semicolon.
#To exit: CTRL + D.

Program is hanging.

$ kubectl get pods

NAME READY STATUS RESTARTS AGE

mycockroach-cockroachdb-0 0/1 Running 0 22h
mycockroach-cockroachdb-1 0/1 Running 0 12m
mycockroach-cockroachdb-2 0/1 Running 0 22h

As status ‘0/1’ for all nodes, there is no output on the pods.


(Alex Robinson) #4

From the logs you sent me, the disks that had been used by mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2 before the nodes restarted got wiped. mycockroach-cockroachdb-1 still has its data, but because two out of the three nodes lost their data, it’s unable to form a quorum and thus unable to do much of anything.

Can you confirm what happened to the data/disks when you restarted your nodes? What environment are you running in?


(JABIL) #5

We are using VmWare ESXi, Ubuntu 4.15.0-30-generic.

One node mycockroach-cockroachdb-2 was in NotReady state. There was unstable work with cockroach db (as I remember one pod was in 1/1 and two pods were in 0/1 state).

I restarted the node, other 2 nodes became in NotReady state. I restarted the ones. All 3 nodes were in Ready state and 503 for cockroach.


(JABIL) #6

From the logs you sent me, the disks that had been used by mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2 before the nodes restarted got wiped.
What does it mean? The db data was lost? Where can I see this in logs?


(Alex Robinson) #7

Both the cockroach_out.log and cockroach_output1.log files you sent me (corresponding to mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2) print out no stores bootstrapped during startup and prefix all their log lines with n?, indicating that they haven’t been allocated a node ID. I’d say that they may have never been properly initialized as part of the cluster, except that mycockroach-cockroachdb-1 has the node ID n3, indicating that it was previously part of a happy, healthy 3 node cluster.

Given that there used to be a three node cluster but now only mycockroach-cockroachdb-1 has its data from before the restart, I’d strongly suggest investigating what may have happened to the disks previously being used by mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2. It may be something that you can reproduce, but it’s hard for me to say without knowledge of your environment and infrastructure.