Kubernetes demo readiness -> (readinessProbe) -> unhealthy

deployment
(Ronald) #1

When following the kubernetes demo in orchestration-with-kubernetes, after the cluster initialisation I get:

>kubectl describe pod cockroachdb-0 Name: cockroachdb-0 Namespace: default Node: docker-for-desktop/192.168.65.3 Start Time: Wed, 15 May 2019 15:55:15 +0200 Labels: app=cockroachdb controller-revision-hash=cockroachdb-c4978bc statefulset.kubernetes.io/pod-name=cockroachdb-0 Annotations: <none> Status: Running IP: 10.1.0.32 Controlled By: StatefulSet/cockroachdb Init Containers: init-certs: Container ID: docker://701df6c8ccd2c9f6f611eeb4dd334bcc2d1ce939054a77a4f6150486373b6179 Image: cockroachdb/cockroach-k8s-request-cert:0.4 Image ID: docker-pullable://cockroachdb/cockroach-k8s-request-cert@sha256:d512bc05c482a1c164544e68299ff7616d4a26325ac9aa2c2ddce89bc241c792 Port: <none> Host Port: <none> Command: /bin/ash -ecx /request-cert -namespace={POD_NAMESPACE} -certs-dir=/cockroach-certs -type=node -addresses=localhost,127.0.0.1,(hostname -f),(hostname -f|cut -f 1-2 -d ‘.’),cockroachdb-public,cockroachdb-public.(hostname -f|cut -f 3- -d '.') -symlink-ca-from=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 15 May 2019 15:55:16 +0200 Finished: Wed, 15 May 2019 15:55:43 +0200 Ready: True Restart Count: 0 Environment: POD_NAMESPACE: default (v1:metadata.namespace) Mounts: /cockroach-certs from certs (rw) /var/run/secrets/kubernetes.io/serviceaccount from cockroachdb-token-xhvdh (ro) Containers: cockroachdb: Container ID: docker://b594b092b29e000b123e52593f5e15057c1b8659b1d5f6b3675613f07c9111c3 Image: cockroachdb/cockroach:v19.1.0 Image ID: docker-pullable://cockroachdb/cockroach@sha256:bcf114228b981b4c2a958407164415089c71f60ba745442a46fdf764b55e137d Ports: 26257/TCP, 8080/TCP Host Ports: 0/TCP, 0/TCP Command: /bin/bash -ecx exec /cockroach/cockroach start --logtostderr --certs-dir /cockroach/cockroach-certs --advertise-host (hostname -f) --http-addr 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%
State: Running
Started: Wed, 15 May 2019 15:55:44 +0200
Ready: True
Restart Count: 0
Liveness: http-get https://:http/health delay=30s timeout=1s period=5s #success=1 #failure=3
Readiness: http-get https://:http/health%3Fready=1 delay=10s timeout=1s period=5s #success=1 #failure=2
Environment:
COCKROACH_CHANNEL: kubernetes-secure
Mounts:
/cockroach/cockroach-certs from certs (rw)
/cockroach/cockroach-data from datadir (rw)
/var/run/secrets/kubernetes.io/serviceaccount from cockroachdb-token-xhvdh (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
datadir:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: datadir-cockroachdb-0
ReadOnly: false
certs:
Type: EmptyDir (a temporary directory that shares a pod’s lifetime)
Medium:
SizeLimit:
cockroachdb-token-xhvdh:
Type: Secret (a volume populated by a Secret)
SecretName: cockroachdb-token-xhvdh
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message


Warning FailedScheduling 19m (x4 over 19m) default-scheduler pod has unbound PersistentVolumeClaims
Normal Scheduled 19m default-scheduler Successfully assigned cockroachdb-0 to docker-for-desktop
Normal SuccessfulMountVolume 19m kubelet, docker-for-desktop MountVolume.SetUp succeeded for volume “pvc-0cd06031-7719-11e9-a193-025000000001”
Normal SuccessfulMountVolume 19m kubelet, docker-for-desktop MountVolume.SetUp succeeded for volume “certs”
Normal SuccessfulMountVolume 19m kubelet, docker-for-desktop MountVolume.SetUp succeeded for volume “cockroachdb-token-xhvdh”
Normal Pulled 19m kubelet, docker-for-desktop Container image “cockroachdb/cockroach-k8s-request-cert:0.4” already present on machine
Normal Created 19m kubelet, docker-for-desktop Created container
Normal Started 19m kubelet, docker-for-desktop Started container
Normal Pulled 18m kubelet, docker-for-desktop Container image “cockroachdb/cockroach:v19.1.0” already present on machine
Normal Created 18m kubelet, docker-for-desktop Created container
Normal Started 18m kubelet, docker-for-desktop Started container
**

Warning Unhealthy 17m (x16 over 18m) kubelet, docker-for-desktop Readiness probe failed: HTTP probe failed with statuscode: 503

**

when checking with curl I get valid output (on a forwarded port)
$>kubectl port-forward cockroachdb-0 8080

$>curl http://localhost:8080/health?ready=1
{
“nodeId”: 1,
“address”: {
“networkField”: “tcp”,
“addressField”: “cockroachdb-0.cockroachdb.default.svc.cluster.local:26257”
},
“buildInfo”: {
“goVersion”: “go1.11.6”,
“tag”: “v19.1.0”,
“time”: “2019/04/29 18:36:40”,
“revision”: “25dd36f0139bf65b80758deeeccf35ee17ebd622”,
“cgoCompiler”: “gcc 6.3.0”,
“cgoTargetTriple”: “x86_64-unknown-linux-gnu”,
“platform”: “linux amd64”,
“distribution”: “CCL”,
“type”: “release”,
“channel”: “official-binary”,
“envChannel”: “kubernetes-secure”,
“dependencies”: null
},
“systemInfo”: {
“systemInfo”: “Linux cockroachdb-0 4.9.125-linuxkit #1 SMP Fri Sep 7 08:20:28 UTC 2018 x86_64 GNU/Linux”,
“kernelInfo”: “4.9.125-linuxkit”
}
}

in the cockroachdb-statefulset-secure.yaml is stated:

    livenessProbe:
      httpGet:
        path: "/health"
        port: http
        scheme: HTTPS
      initialDelaySeconds: 30
      periodSeconds: 5
    readinessProbe:
      httpGet:
        path: "/health?ready=1"
        port: http
        scheme: HTTPS

It all looks good to me. Why are the pods flagged as unhealthy ?
(the GUI shows nothing but happiness, the database is functional)

thanks,
Ronald.

(Ricardo Rocha) #2

Hey @ik_zelf,

I hope to be able to clarify what you are seeing in the event of the kubectl output when initializing the CRDB cluster. I performed the same test in house and got the exact same Event message in my initialization as well.

Warning Unhealthy 4m7s (x17 over 5m27s) kubelet, minikube Readiness probe failed: HTTP probe failed with statuscode: 503

I proceeded to review the entirety of the logs by sh’ing into my first pod, and saw that initially, the cluster is initialized and the first node is successful in starting the gossip to itself. However, the initial gossips to node2 and node3 fail, as those pods are not yet able to communicate to each other. Approximately 12 seconds after this initial failure, gossips are going between the three nodes and everything is happy. This is what is most likely flagging this event you are seeing, and is most likely taking a bit of time for the network connections to be established between pods in k8s. Please feel free to have a look at my log snippets below:

I190515 18:36:40.218687 1 util/log/clog.go:1199 [config] arguments: [/cockroach/cockroach start --logtostderr --certs-dir /cockroach/cockroach-certs --advertise-host cockroachdb-0.cockroachdb.default.svc.cluster.local --http-addr 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%]

I190515 18:36:40.416729 76 gossip/client.go:128 [n?] started gossip client to cockroachdb-0.cockroachdb:26257

W190515 18:36:41.407182 97 vendor/google.golang.org/grpc/clientconn.go:1304 grpc: addrConn.createTransport failed to connect to {cockroachdb-1.cockroachdb:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.17.0.6:26257: connect: connection refused". Reconnecting...

I190515 18:36:42.410084 119 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 [n?] circuitbreaker: gossip [::]:26257->cockroachdb-2.cockroachdb:26257 tripped: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.17.0.7:26257: connect: connection refused"

I190515 18:36:53.432210 283 gossip/client.go:128 [n?] started gossip client to cockroachdb-1.cockroachdb:26257

I190515 18:36:54.442588 294 gossip/client.go:128 [n?] started gossip client to cockroachdb-2.cockroachdb:26257

In your specific case, do you also see this same delay before the communication works itself out?

Let me know if you have any additional questions or concerns.

Cheers,
Ricardo