Certificate seems broken upon gke node upscale

Hi guys,

I followed the manual on the website and had a secure cluster running fine on gke.
But when I tried to upscale the statefulset, the new pod kept crashing. Looked into the logs and found following:

kubectl logs cockroachdb-2
++ hostname -f

  • exec /cockroach/cockroach start --logtostderr --certs-dir /cockroach/cockroach-certs --host cockroachdb-2.cockroachdb.default.svc.cluster.local --http-host 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%
    I171220 23:12:00.767188 1 cli/start.go:785 CockroachDB CCL v1.1.3 (linux amd64, built 2017/11/27 13:59:10, go1.8.3)
    I171220 23:12:00.868037 1 server/config.go:312 available memory from cgroups (8.0 EiB) exceeds system memory 15 GiB, using system memory
    I171220 23:12:00.868086 1 server/config.go:425 system total memory: 15 GiB
    I171220 23:12:00.868161 1 server/config.go:427 server configuration:
    max offset 500000000
    cache size 3.7 GiB
    SQL memory pool size 3.7 GiB
    scan interval 10m0s
    scan max idle time 200ms
    metrics sample interval 10s
    event log enabled true
    linearizable false
    I171220 23:12:00.868300 25 cli/start.go:503 starting cockroach node
    E171220 23:12:00.896955 1 cli/error.go:68 failed to start server: problem using security settings, did you mean to use --insecure?: problem with CA certificate: not found
    Error: failed to start server: problem using security settings, did you mean to use --insecure?: problem with CA certificate: not found
    Failed running “start”

I am using 1.8.4-gke.1

Hi! Is this the documentation you followed? https://github.com/cockroachdb/cockroach/tree/master/cloud/kubernetes#secure-mode-1

Hi Jiale,

Thanks for reporting your problem and including the logs! I’m a little confused by the fact that you say you saw this problem when scaling up the cluster (this step, right?), but included the logs from cockroachdb-2, which is one of the first three nodes of the cluster and should have been running from the start (unless you modified the config).

Anyway, I have a pretty good hunch about what could going on. I’m actually not totally sure why this config has worked for me every time I’ve tried it (on both GKE and Minikube). Specifically, we recently started creating and using a cockroachdb ServiceAccount so that the pod could request certificates from the Kubernetes API, but only the default ServiceAccount is guaranteed to have the cluster CA certificate automatically loaded into it.

I’ll check this out tomorrow. It’s really weird that it works most of the time in my experience and let you spin up your cluster successfully but not scale it. I really hope we won’t have to do anything special with ConfigMaps, as the linked doc suggests.

Thanks for your reply, Alex.

You are right I did followed that step. And first I found the 4th pod (cockroachdb-3) is not able to spin up so was trying scaling down and then up to see if it has anything to do with approve the csr, since cockroachdb-2 should already has been approved. Ended up it gave me the same error as I see in 4th pod.

I did run into some gcloud & kube account related issue with my local dev machine. After some magic gcloud sdk & kubectl version upgrade, it seems resolved and I am able to gke pods from my local dev

Now I have
Google Cloud SDK 181.0.0
bq 2.0.27
core 2017.11.28
gsutil 4.28

and
Client Version: version.Info{Major:“1”, Minor:“5”, GitVersion:“v1.5.1”, GitCommit:“82450d03cb057bab0950214ef122b67c83fb11df”, GitTreeState:“clean”, BuildDate:“2016-12-14T00:57:05Z”, GoVersion:“go1.7.4”, Compiler:“gc”, Platform:“darwin/amd64”}

Is there any diagnose command I can run to find the root cause?

Thanks!

Thanks for the reply, Radu.
I am following up the manual here:
https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes.html#step-9-scale-the-cluster

So you aren’t seeing the problem anymore? Do you know which kubectl version you were using previously?

I’d strongly recommend you upgrade kubectl to at least the same version as your server (in this case, v1.7). If you’re using the one bundled with gcloud you should be able to do so by running gcloud components update.

Sorry did not make it clear. I am not seeing any more the gcloud & kube account related issue which prevented me seeing the gke pods from my local machine.

I am still seeing the upscaling issues.

I found it out.
I did not use the gcloud command to create the gke cluster, was using the UI instead and somehow the credential in local config is out of sync with kube cluster during the time of gcloud set up & upgrade and kubectl upgrade.

Now I use the gcloud command to create the cluster, and everything is working now.

Interesting. I’m not sure exactly how that caused the problem you encountered, but I’m glad you seem to have figured it out. I tried reproducing to make sure, but couldn’t get the ca.crt file to not show up.