Reviving whole cluster after "power off"?

Hello!

I have what I hope is a “silly question”.

I’ve installed Minikube (“none” driver) on an Amazon EC2 instance, and then set up a simple CockroachDB cluster using the instructions here: https://www.cockroachlabs.com/docs/stable/orchestrate-cockroachdb-with-kubernetes-insecure.html#aws-manual. Everything works as described: I’m able to create the example database, insert data, select it back again, view the admin UI, etc.

Obviously in a production environment you would never have cause to power off all nodes, but I’d expect to do this all the time in development. I wasn’t sure what to expect, so I figured I’d just…

sudo shutdown -h now

When I started the EC2 instance again, I was pleased to see that everything was running again…

ubuntu@ip-172-31-3-180:~$ kc get pods
NAME                 READY     STATUS      RESTARTS   AGE
cluster-init-5jfsg   0/1       Completed   0          3h
cockroachdb-0        1/1       Running     0          4h
cockroachdb-1        1/1       Running     0          3h
cockroachdb-2        1/1       Running     0          3h

But the nodes don’t appear to have re-formed a cluster:

ubuntu@ip-172-31-3-180:~$ kc logs cockroachdb-0 | head -100
++ hostname -f
+ exec /cockroach/cockroach start --logtostderr --insecure --advertise-host cockroachdb-0.cockroachdb.default.svc.cluster.local --http-host 0.0.0.0 --join cockroachdb-0.cockroachdb,cockroachdb-1.cockroachdb,cockroachdb-2.cockroachdb --cache 25% --max-sql-memory 25%
W180418 10:01:33.013825 1 cli/start.go:904  RUNNING IN INSECURE MODE!

- Your cluster is open for any client that can access <all your IP addresses>.
- Any user, even root, can log in without providing a password.
- Any user, connecting as root, can read or write any data in your cluster.
- There is no network encryption nor authentication, and thus no confidentiality.

Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v2.0/secure-a-cluster.html
I180418 10:01:33.022723 1 cli/start.go:918  CockroachDB CCL v2.0.0 (x86_64-unknown-linux-gnu, built 2018/04/03 20:56:09, go1.10)
I180418 10:01:33.130412 1 server/config.go:330  available memory from cgroups (8.0 EiB) exceeds system memory 2.0 GiB, using system memory
I180418 10:01:33.130439 1 server/config.go:430  system total memory: 2.0 GiB
I180418 10:01:33.130496 1 server/config.go:432  server configuration:
max offset             500000000
cache size             500 MiB
SQL memory pool size   500 MiB
scan interval          10m0s
scan max idle time     200ms
event log enabled      true
I180418 10:01:33.130517 1 cli/start.go:784  using local environment variables: COCKROACH_CHANNEL=kubernetes-insecure
I180418 10:01:33.130532 1 cli/start.go:791  process identity: uid 0 euid 0 gid 0 egid 0
I180418 10:01:33.130543 1 cli/start.go:461  starting cockroach node
I180418 10:01:33.157790 14 storage/engine/rocksdb.go:552  opening rocksdb instance at "/cockroach/cockroach-data/cockroach-temp708086096"
I180418 10:01:33.257148 14 storage/engine/rocksdb.go:552  opening rocksdb instance at "/cockroach/cockroach-data"
I180418 10:01:33.284825 14 server/config.go:538  [n?] 1 storage engine initialized
I180418 10:01:33.284937 14 server/config.go:541  [n?] RocksDB cache size: 500 MiB
I180418 10:01:33.285042 14 server/config.go:541  [n?] store 0: RocksDB, max size 0 B, max open file limit 1043576
W180418 10:01:33.291088 14 gossip/gossip.go:1292  [n?] no incoming or outgoing connections
I180418 10:01:33.291226 14 server/server.go:1084  [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W180418 10:01:48.298533 57 gossip/gossip.go:1095  [n?] invalid bootstrap address: &{typ:tcp addr:cockroachdb-0.cockroachdb:26257}, lookup cockroachdb-0.cockroachdb on 10.96.0.10:53: read udp 172.17.0.5:53090->10.96.0.10:53: i/o timeout
I180418 10:01:48.304593 81 gossip/client.go:129  [n?] started gossip client to cockroachdb-1.cockroachdb:26257
I180418 10:01:48.307872 81 gossip/client.go:136  [n?] closing client to cockroachdb-1.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-1.cockroachdb:26257); loopback connection
I180418 10:01:49.313895 101 gossip/client.go:129  [n?] started gossip client to cockroachdb-2.cockroachdb:26257
I180418 10:01:49.315196 101 gossip/client.go:136  [n?] closing client to cockroachdb-2.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-2.cockroachdb:26257); loopback connection
I180418 10:01:50.306415 118 gossip/client.go:129  [n?] started gossip client to cockroachdb-0.cockroachdb:26257
I180418 10:01:50.307254 145 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-0.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:50.307505 118 gossip/client.go:136  [n?] closing client to cockroachdb-0.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-0.cockroachdb:26257); loopback connection
I180418 10:01:50.307752 147 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-2.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:50.315407 155 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-1.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:51.305445 159 gossip/client.go:129  [n?] started gossip client to cockroachdb-1.cockroachdb:26257
I180418 10:01:51.306857 159 gossip/client.go:136  [n?] closing client to cockroachdb-1.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-1.cockroachdb:26257); loopback connection
I180418 10:01:52.306919 165 gossip/client.go:129  [n?] started gossip client to cockroachdb-2.cockroachdb:26257
I180418 10:01:52.307729 165 gossip/client.go:136  [n?] closing client to cockroachdb-2.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-2.cockroachdb:26257); loopback connection
I180418 10:01:53.306471 170 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-2.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:53.308265 176 gossip/client.go:129  [n?] started gossip client to cockroachdb-0.cockroachdb:26257
I180418 10:01:53.308551 180 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-0.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:53.308771 176 gossip/client.go:136  [n?] closing client to cockroachdb-0.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-0.cockroachdb:26257); loopback connection
I180418 10:01:53.316237 183 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-1.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:54.309400 190 gossip/client.go:129  [n?] started gossip client to cockroachdb-1.cockroachdb:26257
I180418 10:01:54.310301 190 gossip/client.go:136  [n?] closing client to cockroachdb-1.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-1.cockroachdb:26257); loopback connection
I180418 10:01:55.310110 196 gossip/client.go:129  [n?] started gossip client to cockroachdb-2.cockroachdb:26257
I180418 10:01:55.310726 196 gossip/client.go:136  [n?] closing client to cockroachdb-2.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-2.cockroachdb:26257); loopback connection
I180418 10:01:56.309870 202 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-2.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:56.311076 206 gossip/client.go:129  [n?] started gossip client to cockroachdb-0.cockroachdb:26257
I180418 10:01:56.311393 210 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-0.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:56.311664 206 gossip/client.go:136  [n?] closing client to cockroachdb-0.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-0.cockroachdb:26257); loopback connection
I180418 10:01:56.318929 213 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-1.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:57.312186 217 gossip/client.go:129  [n?] started gossip client to cockroachdb-1.cockroachdb:26257
I180418 10:01:57.313124 217 gossip/client.go:136  [n?] closing client to cockroachdb-1.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-1.cockroachdb:26257); loopback connection
I180418 10:01:58.312807 223 gossip/client.go:129  [n?] started gossip client to cockroachdb-2.cockroachdb:26257
I180418 10:01:58.313396 223 gossip/client.go:136  [n?] closing client to cockroachdb-2.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-2.cockroachdb:26257); loopback connection
I180418 10:01:59.312325 232 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-2.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:59.313499 236 gossip/client.go:129  [n?] started gossip client to cockroachdb-0.cockroachdb:26257
I180418 10:01:59.313627 240 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-0.cockroachdb.default.svc.cluster.local:26257}
I180418 10:01:59.313743 236 gossip/client.go:136  [n?] closing client to cockroachdb-0.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-0.cockroachdb:26257); loopback connection
I180418 10:01:59.321991 243 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-1.cockroachdb.default.svc.cluster.local:26257}
I180418 10:02:00.314674 247 gossip/client.go:129  [n?] started gossip client to cockroachdb-1.cockroachdb:26257
I180418 10:02:00.315947 247 gossip/client.go:136  [n?] closing client to cockroachdb-1.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-1.cockroachdb:26257); loopback connection
I180418 10:02:01.315521 253 gossip/client.go:129  [n?] started gossip client to cockroachdb-2.cockroachdb:26257
I180418 10:02:01.316106 253 gossip/client.go:136  [n?] closing client to cockroachdb-2.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-2.cockroachdb:26257); loopback connection
I180418 10:02:02.315340 259 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-2.cockroachdb.default.svc.cluster.local:26257}
I180418 10:02:02.316603 263 gossip/client.go:129  [n?] started gossip client to cockroachdb-0.cockroachdb:26257
I180418 10:02:02.316922 267 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-0.cockroachdb.default.svc.cluster.local:26257}
I180418 10:02:02.317151 263 gossip/client.go:136  [n?] closing client to cockroachdb-0.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-0.cockroachdb:26257); loopback connection
I180418 10:02:02.329255 270 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp cockroachdb-1.cockroachdb.default.svc.cluster.local:26257}
W180418 10:02:03.291774 271 server/server.go:1040  The server appears to be unable to contact the other nodes in the cluster. Please try

- starting the other nodes, if you haven't already
- double-checking that the '--join' and '--host' flags are set up correctly
- running the 'cockroach init' command if you are trying to initialize a new cluster

If problems persist, please see https://www.cockroachlabs.com/docs/v2.0/cluster-setup-troubleshooting.html.
I180418 10:02:03.317607 275 gossip/client.go:129  [n?] started gossip client to cockroachdb-1.cockroachdb:26257
I180418 10:02:03.318235 275 gossip/client.go:136  [n?] closing client to cockroachdb-1.cockroachdb:26257: stopping outgoing client to node 0 (cockroachdb-1.cockroachdb:26257); loopback connection

The link near the bottom seemed promising, but I couldn’t find anything that was obviously applicable. The nodes appear to be “gossipping” with each other, so I don’t imagine it’s a connectivity issue.

I’m unsure of what my next step should be, because I don’t know what I should expect in this scenario. So my questions are…

  • Is restarting a cluster in this way supported at all? Or if I kill all nodes do I need restore from backup? (And therefore if I want to “pause” a development environment, do I have to use a VM to achieve that?)

  • If this sort of thing is supposed to be possible, have I just missed a step where I’m supposed to manually tell the nodes how to re-join to each other? I.e. do I need to nominate one as a “master” and then tell the other two to join that one? I assumed this would be naturally handled by the Kubernetes configuration – is that a bad assumption?

I’m totally new to both Kubernetes and CockroachDB, so my main issue here is in not understanding what to expect – so I’m hoping someone will be able to help correct my mental model of all this. :slight_smile:

Many thanks for any pointers!

Restarting the cluster in that way is definitely possible and supported.

The errors suggest that the network connectivity between the containers was not reconfigured properly upon reboot. This is the area where you should investigate further: are the containers able to talk to each other using these hostnames (cockroach-0.cockroach, etc) after the reboot.

Wonderful, thank you! :slight_smile: I will explore in this direction and report back.

It looks like the storage didn’t persist across the restart, since if it did all those log lines would say [n1] instead of [n?]. Minikube doesn’t make much in the way of guarantees across restarts, so that’s potentially what’s going on: https://kubernetes.io/docs/getting-started-guides/minikube/#persistent-volumes

If you want to confirm, try running kubectl exec -it cockroachdb-0 -- ./cockroach init --insecure. If it successfully reinitializes the cluster, then you definitely lost your data on the restart.

To be clear, full-cluster restarts are very supported by cockroach (including when running on kubernetes). Just not if disks get wiped during the restart.

ubuntu@ip-172-31-3-180:~$ kc exec -it cockroachdb-0 bash
root@cockroachdb-0:/cockroach# ping cockroachdb-0.cockroachdb
PING cockroachdb-0.cockroachdb.default.svc.cluster.local (172.17.0.3): 56 data bytes
64 bytes from 172.17.0.3: icmp_seq=0 ttl=64 time=0.470 ms
64 bytes from 172.17.0.3: icmp_seq=1 ttl=64 time=0.042 ms
^C--- cockroachdb-0.cockroachdb.default.svc.cluster.local ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.042/0.256/0.470/0.214 ms
root@cockroachdb-0:/cockroach# ping cockroachdb-1.cockroachdb
PING cockroachdb-1.cockroachdb.default.svc.cluster.local (172.17.0.6): 56 data bytes
64 bytes from 172.17.0.6: icmp_seq=0 ttl=64 time=0.053 ms
64 bytes from 172.17.0.6: icmp_seq=1 ttl=64 time=0.052 ms
64 bytes from 172.17.0.6: icmp_seq=2 ttl=64 time=0.054 ms
^C--- cockroachdb-1.cockroachdb.default.svc.cluster.local ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.052/0.053/0.054/0.000 ms

Connectivity seems fine. I believe it’s what Alex suggested below.

ubuntu@ip-172-31-3-180:~$ kubectl exec -it cockroachdb-0 -- ./cockroach init --insecure
Cluster successfully initialized

Ahhh, yep. That’ll do it. :slight_smile:

Looks like it’s time for me to take a break from copying and pasting code from the internet, and go learn about Kubernetes and Minikube properly. :grin:

Thanks all for your help!

1 Like