Secure cockroachdb cluster on AWS EKS

deployment

(Andy Woods) #2

Hi Sashanka,

Could you share with me the exact commands and reproduction steps to get this problem?

We are a bit surprised to see x509: certificate is valid for node, not cockroachdb-0.cockroachdb


(Sashanka) #3

Hi Andy Woods,

On AWS EKS cluster, using the default yaml scripts provided in docs https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html#aws-manual
created cockroachdb cluster(secure). It asks for csr approval, after approaving the csr, when i see pods, they are in crashloopbackoff state

NAME READY STATUS RESTARTS AGE
cluster-init-secure-vtjlg 0/1 CrashLoopBackOff 12 36m
cockroachdb-0 0/1 CrashLoopBackOff 29 1h
cockroachdb-1 0/1 Running 29 1h
cockroachdb-2 0/1 CrashLoopBackOff 29 1h

the pod logs shows:
I180711 13:38:33.142903 1 cli/start.go:789 using local environment variables: COCKROACH_CHANNEL=kubernetes-secure I180711 13:38:33.142941 1 cli/start.go:796 process identity: uid 0 euid 0 gid 0 egid 0 I180711 13:38:33.142978 1 cli/start.go:461 starting cockroach node I180711 13:38:33.145061 10 storage/engine/rocksdb.go:552 opening rocksdb instance at “/cockroach/cockroach-data/cockroach-temp107022697” I180711 13:38:33.168828 10 server/server.go:734 [n?] monitoring forward clock jumps based on server.clock.forward_jump_check_enabled I180711 13:38:33.169476 10 storage/engine/rocksdb.go:552 opening rocksdb instance at “/cockroach/cockroach-data” I180711 13:38:33.182480 10 server/config.go:538 [n?] 1 storage engine initialized I180711 13:38:33.182526 10 server/config.go:541 [n?] RocksDB cache size: 1.8 GiB I180711 13:38:33.182554 10 server/config.go:541 [n?] store 0: RocksDB, max size 0 B, max open file limit 60536 W180711 13:38:33.183018 10 gossip/gossip.go:1293 [n?] no incoming or outgoing connections I180711 13:38:33.183098 10 server/server.go:1306 [n?] no stores bootstrapped and --join flag specified, awaiting init command. W180711 13:38:33.191535 82 vendor/google.golang.org/grpc/server.go:563 grpc: Server.Serve failed to complete security handshake from “x.x.x.x:46044": remote error: tls: bad certificate W180711 13:38:33.191698 76 vendor/google.golang.org/grpc/clientconn.go:830 Failed to dial cockroachdb-0.cockroachdb:26257: connection error: desc = “transport: authentication handshake failed: x509: certificate is valid for node, not cockroachdb-0.cockroachdb”; please retry. W180711 13:38:33.191849 72 gossip/client.go:123 [n?] failed to start gossip client to cockroachdb-0.cockroachdb:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure W180711 13:38:34.195112 62


(Alex Robinson) #4

What do you see when you run kubectl describe csr default.node.cockroachdb-0?


(Alex Robinson) #5

And what’s the output of kubectl version?


(Sashanka) #6

Kubectl version

Client Version: version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.3”, GitCommit:“2bba0127d85d5a46ab4b778548be28623b32d0b0”, GitTreeState:“clean”, BuildDate:“2018-05-28T20:03:09Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“darwin/amd64”}
Server Version: version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.3”, GitCommit:“2bba0127d85d5a46ab4b778548be28623b32d0b0”, GitTreeState:“clean”, BuildDate:“2018-05-28T20:13:43Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”}


(Sashanka) #7

Initialy, when i start nodes using kubectl create -f https://raw.githubusercontent.com/cockroachdb/cockroach/master/cloud/kubernetes/cockroachdb-statefulset-secure.yaml
i am able to see csr and approved them. After a while i am not seeing any csr’s when i give kubectl get csr.

kubectl describe csr default.node.cockroachdb-0

Error from server (NotFound): certificatesigningrequests.certificates.k8s.io “default.node.cockroachdb-0” not found


(Alex Robinson) #8

Ugh, that’s frustrating. What I want to do is inspect the contents of the certificate, because it’s very wrong that the certificate is valid for “node” but not for “cockroachdb-0.cockroachdb”. We can get the same information out of the secret that now contains the certificate, but it’ll be a little more work than if the csr was still there.

  1. Run kubectl get secret default.node.cockroachdb-0 -o yaml
  2. You should see some key-value pairs in the resulting yaml. Grab the long string that appears next to the key cert
  3. Run that string through a base64 decoder (e.g. https://www.base64decode.org/)
  4. Run the resulting string through openssl, either using a site like https://www.sslshopper.com/certificate-decoder.html or a command like openssl x509 -in <your-cert-file> -text -noout

(Sashanka) #9

Certificate Information:
Common Name: node
Organization: Cockroach
Valid From: July 12, 2018
Valid To: July 12, 2019
Serial Number: xxxxxxxxxx


(Sashanka) #10

Any solution to this issue please.


(Alex Robinson) #11

Sorry, @Sashanka, I thought I had responded to this yesterday. It’s really a bummer that the CSR is gone, since it would help confirm where the problem is coming from. The problem is that the certificate doesn’t have any Subject Alternative Names, which are what normally make the certificate valid at addresses like cockroachdb-0.cockroachdb.

Given that you say you haven’t changed anything about the provided config files, the only probable cause I can imagine is that there’s something wrong with the way EKS’s certificate signer handles alternative names. Googling hasn’t brought up any details on the subject, though, which could just mean that there aren’t many people using EKS yet, but would otherwise be pretty surprising.

Just to confirm, could you try deleting all the cockroach-related resources in the Kubernetes cluster as outlined at https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html#step-11-stop-the-cluster and trying again? Grab the contents of the CSRs as well as the logs from both the main cockroach container (kubectl logs cockroachdb-0 cockroachdb) and the container that asks for the certificates (kubectl logs cockroachdb-0 init-certs).


(Sashanka) #12

Thanks Alex.
The problem is EKS is not including Subject Alternative Names when it issues certificate.
Below is missing in EKS certificate:
X509v3 Subject Alternative Name:
DNS:my-svc.my-namespace.svc.cluster.local, DNS:my-pod.my-namespace.pod.cluster.local, IP Address:x.x.x.x, IP Address:x.x.x.x


(Alex Robinson) #13

Are you aware of any issue trackers for EKS that have something about this? It’s pretty surprising, and I’d be interested in following up on it.


(Sashanka) #14

Alex,

I haven’t seen any issue raised in their forums. I will create one.

Thank you


(Alex Robinson) #15

Thanks! If you’d like guidance on setting up certificates manually in the meantime, just let me know. It’s easier than you might expect.


(Sashanka) #16

Sure Alex. Thank you.


(Sashanka) #17

Alex,
Could you please provide me the process or the steps to create certificates manually.


(Alex Robinson) #18

Try out the config file I’ve just posted to https://github.com/cockroachdb/cockroach/pull/27921. Instructions are in the comment at the top of the file.


Issue creating an secure deployment on EKS using the helm chart
(Sashanka) #19

Thank you Alex.It worked for me.


(Sashanka) #20

Hi Alex,

I got some feedback from AWS EKS service team regarding CSR approval and signing.It says that EKS has a custom certificate signer, which signs only certificates for kubelet and ignoring other internal names and SAN.


(Alex Robinson) #21

Thanks for the update, @Sashanka! We’ll get something like https://github.com/cockroachdb/cockroach/pull/27921 checked in and documented.