Setting up a secure cluster


#1

I have set up an insecure cluster on three nodes in three different datacenters. This was incredibly simple.

But when I try to make it secure… It doesn’t work. I tried to follow the docs (for example https://www.cockroachlabs.com/docs/stable/training/security.html). But nothing works. Mosty because of trouble with certificates.

The docs give information about setting up cluster with three nodes on one server. So creating certificates is completely different (you cannot copy paste the code).

Does anybody know a tutorial on how to create a secure cluster on three different servers?


(Marc) #2

You can follow the secure manual deployment steps for details on what needs to be done.

The important part about node certificates is that each node’s certificate must list all the IP addresses and DNS names used to connect to it. This could be localhost (if you ever issue local requests), IP/DNS in a private network (eg: 10.0.0.2, node1.internal.gcp.etc…), external IP/DNS (if the external address is used), or even load balancer and other DNS setup addresses if used by the clients.

Each node will have its own set of addresses and its own node certificate.


#3

I followed all steps. But I still get a warning:

WARNING: [n?] listen address "::" not in node certificate (IP=xx.xx.xx.xx,127.0.0.1; DNS=localhost,vps-amst,vps-frankfurt,vps-ny; CN=node)
* Secure node-node and SQL connections are likely to fail.
* Consider extending the node certificate or tweak --listen-addr/--advertise-addr.

<<xx.xx.xx.xx>> is the address of my first node. It is in the list…


(Marc) #4

That warning is a little bit overzealous. In this case, it indicates that neither --listen-addr nor --advertise-addr were specified, so the node will attempt to determine it by itself. This is usually ok in a private network as local hostnames are resolvable by DNS (though not always). In certain environment, and especially with separate networks (eg: multi-region deployments), you absolutely should specify either --listen-addr or --advertise-addr (or both).

If in doubt, you can set --advertise-addr to the DNS or IP address you will use in the --join command. If all nodes are on the same network, this is likely to be the private network address.


#5

I used both and after that I didn’t get the warning.

But…

Starting all nodes works fine now.
I can access the admin ui (xx.xx.xx.xx:8080) on all nodes.

But the nodes don’t form a cluster (they don’t join).

When I do the ‘init’ I get a error (again with certificates):

$ cockroach init --certs-dir=certs --host=xx.xx.xx.xxx
E181103 19:33:37.564005 1 cli/error.go:230  SSL authentication error while connecting.

initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: authentication handshake failed: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Cockroach CA\")"
Error: SSL authentication error while connecting.

What to do?


(Marc) #6

It looks like the certificate authority differs between nodes. The ca.crt must the same on all nodes as that is used to verify then identify of each node.

Could you paste the commands you used to generate the certificates? It should be something along the lines of:

  1. Just once: cockroach cert create-ca ...
  2. For each node:
    1. cockroach cert create-node ...
    2. copy ca.crt, node.crt and node.key to the node
  3. For each client:
    1. cockroach cert create-client ...
    2. copy ca.crt, client.<username>.crt and client.<username>.key to the client

Throughout the entire process, ca.crt is the one generated in step 1. This is the CA used to sign and verify all certificates and must be known by all parties.
If you run step 1. again, you will need to regenerate all certificates, you can’t mix and match the CA.


#7

Thx. I am one step further ("Just once: cockroach cert create-ca ..."). I did that the wrong way.

But I still get an error with init…

For step 1, I use:

cockroach cert create-ca \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key

Step 2, the nodes:

cockroach cert create-node \
xx.76.36.137 \
10.7.96.3 \
vps-amst \
localhost \
127.0.0.1 \
--certs-dir=certs \
--ca-key=my-safe-directory/ca.key

And then copied them with (after directories where created):

scp certs/ca.crt \
certs/node.crt \
certs/node.key \
root@xx.76.36.137:~/certs

After that: delete node crt + key (keep ca.crt):

rm certs/node.crt certs/node.key

etc.etc.

Still the same: nodes are running. But not joined in a cluster.

But now I get a different error:

$ cockroach init --certs-dir=certs --host=xx.76.95.158
E181103 20:52:41.013794 1 cli/error.go:230  rpc error: code = AlreadyExists desc = cluster has already been initialized with ID 7f7eaa6e-42ca-4aeb-bab4-30b98e7a31d4
Error: rpc error: code = AlreadyExists desc = cluster has already been initialized with ID 7f7eaa6e-42ca-4aeb-bab4-30b98e7a31d4
Failed running "init"

(Marc) #8

The error indicates that the cluster has already been initialized. You only need to run cockroach init once, and only against a single node. Once other nodes talk to that first one, they will join the initialized cluster.

Running init multiple times is usually not an issue, it just shows you the error you received here. However, running init against multiple nodes that cannot talk to each other will create multiple clusters and the nodes won’t be able to join.


#9

And that’s the problem I tried to fix for the last 10 hours…

I simply followed every step from the doc you provided:
(https://www.cockroachlabs.com/docs/stable/deploy-cockroachdb-on-premises.html)

Are you sure that tutorial is complete and without mistakes?

For example, your comment " you absolutely should specify either --listen-addr or --advertise-addr (or both)." is missing in the tutorial…

Are there any other parts missing? Or is it just me who makes mistakes…


(Marc) #10

It’s hard to tell what’s going wrong, but my guess would be one of the following:

  • still the wrong addresses in the node certificate
  • all nodes were initialized separately (if you ran init against each node when they were unable to talk to each other, you ended up with three separate clusters)

As for the docs, it does not attempt to address all deployment scenarios. The flag description around the start command links to more information about networking. This will explain what --listen-addr and --advertise-addr mean and when to use them.

If the networking setup is unclear, the best way would be to bring up the three nodes with --insecure, initialize the cluster, and make sure they can talk to each other. If that all works fine, then you can restart the nodes with the generated certificates.


#11

Thx for that tip!

  • I started from the beginning
  • installed + run the cluster --insecure
  • I check that everything worked
  • I generated the certificates
  • And run the cluster secure

I did that, and now it’s working…

happy!