Secure init fails

Hi there, I am trying to set up a demo cockroachdb cluster in k8s using the command:

-f [https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/cockroachdb-statefulset-secure.yaml```](https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/cockroachdb-statefulset-secure.yaml)

serviceaccount/cockroachdb created
[role.rbac.authorization.k8s.io/cockroachdb](http://role.rbac.authorization.k8s.io/cockroachdb) created
[clusterrole.rbac.authorization.k8s.io/cockroachdb](http://clusterrole.rbac.authorization.k8s.io/cockroachdb) created
[rolebinding.rbac.authorization.k8s.io/cockroachdb](http://rolebinding.rbac.authorization.k8s.io/cockroachdb) created
[clusterrolebinding.rbac.authorization.k8s.io/cockroachdb](http://clusterrolebinding.rbac.authorization.k8s.io/cockroachdb) created
service/cockroachdb-public created
service/cockroachdb created
poddisruptionbudget.policy/cockroachdb-budget created
statefulset.apps/cockroachdb created
Sridhars-MacBook-Pro:cockroachdb sridhar$ 
kubectl -n default get pods
NAME                      READY   STATUS    RESTARTS   AGE
busybox-c8f8564d4-4k8ml   1/1     Running   0          2d10h
cockroachdb-0             0/1     Running   0          32s
cockroachdb-1             0/1     Running   0          32s
cockroachdb-2             0/1     Running   0          32s
Sridhars-MacBook-Pro:cockroachdb sridhar$ kubectl get csr
No resources found.

When I describe one of the cockroachdb pods to see if init-cert has succeeded, I don't see anything that says that init-cert has failed:

kubectl -n default describe pod/cockroachdb-0
Name:               cockroachdb-0
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               ip-192-168-248-201.us-west-2.compute.internal/[192.168.248.201](http://192.168.248.201/)
Start Time:         Sat, 02 Nov 2019 17:37:47 -0700
Labels:             app=cockroachdb
                    controller-revision-hash=cockroachdb-7c4899574d
 [statefulset.kubernetes.io/pod-name=cockroachdb-0](http://statefulset.kubernetes.io/pod-name=cockroachdb-0)
Annotations:        [kubernetes.io/psp](http://kubernetes.io/psp): eks.privileged
Status:             Running
IP:                 192.168.202.143
Controlled By:      StatefulSet/cockroachdb
Init Containers:
  init-certs:
    Container ID:  docker://daa22bc16a83780a014335963a09fe75b16ae86cae2417c835c3376d49cbcf21
    Image:         cockroachdb/cockroach-k8s-request-cert:0.4
    Image ID:      docker-pullable://cockroachdb/cockroach-k8s-request-cert@sha256:d512bc05c482a1c164544e68299ff7616d4a26325ac9aa2c2ddce89bc241c792
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/ash
      -ecx
      /request-cert -namespace=${POD_NAMESPACE} -certs-dir=/cockroach-certs -type=node -addresses=localhost,127.0.0.1,$(hostname -f),$(hostname -f|cut -f 1-2 -d '.'),cockroachdb-public,cockroachdb-public.$(hostname -f|cut -f 3- -d '.') -symlink-ca-from=/var/run/secrets/[kubernetes.io/serviceaccount/ca.crt](http://kubernetes.io/serviceaccount/ca.crt)
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 02 Nov 2019 17:37:57 -0700
      Finished:     Sat, 02 Nov 2019 17:37:57 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      POD_NAMESPACE:  default (v1:metadata.namespace)
    Mounts:
      /cockroach-certs from certs (rw)
      /var/run/secrets/[kubernetes.io/serviceaccount](http://kubernetes.io/serviceaccount) from cockroachdb-token-fzpg8 (ro)
Containers:
  cockroachdb:
    Container ID:  docker://72033eaeb6a95a64f8e237c9fd3d75ba4dec9bb4d80a8ac46632321aaad781a4
    Image:         cockroachdb/cockroach:v19.1.5
    Image ID:      docker-pullable://cockroachdb/cockroach@sha256:44249e8133bd5c02165703854a86d84089fa741a018071cfe41b5ce4cda7ac39
    Ports:         26257/TCP, 8080/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /bin/bash
      -ecx
      exec /cockroach/cockroach start --logtostderr --certs-dir /cockroach/cockroach-certs --advertise-host $(hostname -f) --http-addr 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%
    State:          Running
      Started:      Sat, 02 Nov 2019 17:38:39 -0700
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 02 Nov 2019 17:37:58 -0700
      Finished:     Sat, 02 Nov 2019 17:38:38 -0700
    Ready:          False
    Restart Count:  1
    Liveness:       http-get https://:http/health delay=30s timeout=1s period=5s #success=1 #failure=3
    Readiness:      http-get https://:http/health%3Fready=1 delay=10s timeout=1s period=5s #success=1 #failure=2
    Environment:
      COCKROACH_CHANNEL:  kubernetes-secure
    Mounts:
      /cockroach/cockroach-certs from certs (rw)
      /cockroach/cockroach-data from datadir (rw)
      /var/run/secrets/[kubernetes.io/serviceaccount](http://kubernetes.io/serviceaccount) from cockroachdb-token-fzpg8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  datadir:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  datadir-cockroachdb-0
    ReadOnly:   false
  certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:    
    SizeLimit:  <unset>
  cockroachdb-token-fzpg8:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cockroachdb-token-fzpg8
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     [node.kubernetes.io/not-ready:NoExecute](http://node.kubernetes.io/not-ready:NoExecute) for 300s
[node.kubernetes.io/unreachable:NoExecute](http://node.kubernetes.io/unreachable:NoExecute) for 300s
Events:
  Type     Reason                  Age                From                                                    Message
  ----     ------                  ----               ----                                                    -------
  Normal   Scheduled               81s                default-scheduler                                       Successfully assigned default/cockroachdb-0 to ip-192-168-248-201.us-west-2.compute.internal
  Normal   SuccessfulAttachVolume  79s                attachdetach-controller                                 AttachVolume.Attach succeeded for volume "pvc-4d97e0be-fb93-11e9-9f25-02dfe640c3cc"
  Normal   Pulled                  71s                kubelet, ip-192-168-248-201.us-west-2.compute.internal  Container image "cockroachdb/cockroach-k8s-request-cert:0.4" already present on machine
  Normal   Created                 71s                kubelet, ip-192-168-248-201.us-west-2.compute.internal  Created container init-certs
  Normal   Started                 71s                kubelet, ip-192-168-248-201.us-west-2.compute.internal  Started container init-certs
  Normal   Pulled                  30s (x2 over 70s)  kubelet, ip-192-168-248-201.us-west-2.compute.internal  Container image "cockroachdb/cockroach:v19.1.5" already present on machine
  Normal   Created                 30s (x2 over 70s)  kubelet, ip-192-168-248-201.us-west-2.compute.internal  Created container cockroachdb
  Warning  Unhealthy               30s (x3 over 40s)  kubelet, ip-192-168-248-201.us-west-2.compute.internal  Liveness probe failed: HTTP probe failed with statuscode: 503
  Normal   Killing                 30s                kubelet, ip-192-168-248-201.us-west-2.compute.internal  Container cockroachdb failed liveness probe, will be restarted
  Normal   Started                 29s (x2 over 70s)  kubelet, ip-192-168-248-201.us-west-2.compute.internal  Started container cockroachdb
  Warning  Unhealthy               11s (x4 over 56s)  kubelet, ip-192-168-248-201.us-west-2.compute.internal  Readiness probe failed: Get https://192.168.202.143:8080/health?ready=1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy               2s (x6 over 47s)   kubelet, ip-192-168-248-201.us-west-2.compute.internal  Readiness probe failed: HTTP probe failed with statuscode: 503

Hi @sridhar

I am trying to determine what the issue is, but the way that the outputs are formatted are making it difficult to see the error, if any. If I understand correctly, the cockroach nodes are not running, judging from the output of the kubectl -n default get pods command, but what is the error you are seeing when they are starting?

Cheers,
Ricardo

Hi Ricardo, thanks for getting back to me. I am trying to follow the instructions in https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html#manual and bringing up a secure cluster. After I run
kubectl -n default apply -f cockroachdb-statefulset-secure.yaml

The instructions say that I will find three csr’s (one from each of the pods) pending. I don’t see any when I run
kubectl -n default get csr

Hope this makes sense.
–Sridhar

Hey @sridhar

When you run the kubectl apply, are you getting an error? If the kubectl get csr command doesn’t show any signing requests, it would indicate that the pods may not be making the requests to begin with. The next step would be to determine why the pods aren’t making those requests. Were there any changes done to the YAML file that we provide from our docs page?

Hi there,

I bumped into this problem when I was installed cockroachdb to a cluster the 2nd time. The root cause for me was that cockroachdb will not generate csr objects if underlying secrets already exist with the same name.

I had to run something like this to delete the old secrets (which get created after you approve a csr), because they were left behind by a previous installation in the cluster:
kubectl delete secret default.client.root
kubectl delete secret default.node.database-cockroachdb-0
kubectl delete secret default.node.database-cockroachdb-1
kubectl delete secret default.node.database-cockroachdb-2

To be safe, when I uninstall now, I run these commands:
helm delete database
helm del database --purge
kubectl delete pvc datadir-database-cockroachdb-0
kubectl delete pvc datadir-database-cockroachdb-1
kubectl delete pvc datadir-database-cockroachdb-2
kubectl delete csr default.client.root
kubectl delete csr default.node.database-cockroachdb-0
kubectl delete csr default.node.database-cockroachdb-1
kubectl delete csr default.node.database-cockroachdb-2
kubectl delete secret default.client.root
kubectl delete secret default.node.database-cockroachdb-0
kubectl delete secret default.node.database-cockroachdb-1
kubectl delete secret default.node.database-cockroachdb-2
kubectl delete serviceaccount cockroachdb

I use this chart to install:
https://hub.kubeapps.com/charts/stable/cockroachdb

With this YML:
Secure:
Enabled: true
ServiceAccount:
Name: cockroachdb
CacheSize: 20%
MaxSQLMemory: 20%
StorageClass: do-block-storage

Good luck,

Kyle

Hi Kyle, thanks for the important hint! That certainly was the issue. Once I delete the 4 secrets from the namespace, I was able to install the three cockroach pods and approve their certs.

When running cluster-init-secure, I ran into another error though. (I am running everything in a namespace called sridhar2)

kubectl -n sridhar2 logs pod/cluster-init-secure-z5vr7

  • ERROR: SSL authentication error while connecting.

  • initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: authentication handshake failed: x509: certificate is valid for node, not cockroachdb-0.cockroachdb”

E191121 02:35:44.969383 1 cli/error.go:229 SSL authentication error while connecting.

initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: authentication handshake failed: x509: certificate is valid for node, not cockroachdb-0.cockroachdb”

Error: SSL authentication error while connecting.

initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: authentication handshake failed: x509: certificate is valid for node, not cockroachdb-0.cockroachdb”

Failed running “init”

Hey @sridhar

That error message “transport: authentication handshake failed: x509: certificate is valid for node, not cockroachdb-0.cockroachdb” indicates you may be using the wrong cert to try to connect to the node. Are you using a client..crt as we mention in our documentation? You can find that here.

Let me know if you have any other questions.

Cheers,
Ricardo

Hi Ricardo, I am using the pre-provided k8s yaml files following directions in https://www.cockroachlabs.com/docs/v19.2/orchestrate-cockroachdb-with-kubernetes.html#manual

We are on k8s 1.14.3 running on AWS managed k8s.

I am creating the cockroach nodes using the configs option.

kubectl create \
-f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/cockroachdb-statefulset-secure.yaml

works fine.

I approve the CSR’s that come up. I then run:

kubectl create \
-f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/cluster-init-secure.yaml

and approve the csr for it.

init then fails because the certs that are created in cluster-init-secure.yaml don’t allow for communication with cockroachdb-0.cockroachdb.

Even if I add “-addresses=cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb” to the init-cert command in cluster-init-secure.yaml I end up with the error:

initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing dial tcp 192.168.186.30:26257: connect: connection refused”

Error: cannot dial server.

Is the server running?

If the server is running, check --host client-side and --advertise server-side.