CA certificate and key

Can I use seperate CA certs and keys for every node or do I just generate one and then copy that manually to the other nodes?

Automatic installation of crdb via some script or config management tool would be difficult if there were just one CA cert and key for the whole cluster. Because I had to somehow teach the declarative config management tool (Chef, Puppet, Saltstack etc.) to first create a “main node”, which would then generate the CA cert and key. Those would then have to be copied to the other hosts, which is a dangerous process.

The declarative approach would be to not have a “leader” node (aka as SPOF), but make all nodes equal and self-contained. Thus every node would generate its own CA key and cert, but I somehow doubt that inter-cluster communication would work this way.

Basically, I want my nodes to be independent of each other (avoid SPOF, enable horizontal scaling), but I also don’t want to generate a CA key and cert externally and then distribute it to all nodes over the network. Sorry for rambling, this is just meant as context for the question in the first sentence :slight_smile:

Ulrich

Ulrich,

When using self-signed certificates (what is shown is our docs), the CA certificate must be known by all nodes and clients, that’s the only way they can authenticate each other.
The CA key doesn’t need to be (and shouldn’t be) on any nodes, it’s only needed to generate node/client new certificates.

We recommend generating the CA key and certificate on whatever machine you’re using to launch the cluster (eg: your own computer), keeping the CA key safe. You then generate new node certs and keys for every node and push those along with the CA certificate.

We’re looking into integration with existing PKI tools, but this has not starter yet.

Ok… I was actually hoping to generate the node keys and certificates on the nodes themselves, so I don’t have to copy them over the network. But then I would have to copy the CA key onto the nodes first… not good. Anyway, thanks for the explanation, looks like I have to think of something different.

Ulrich

Ok… I thought of something different. How about if I just delete the CA key after certificate generation? That would remove this attack vector against my cluster.

Obviously I would have to generate a new CA key and re-generate all certs every time I want to add another node to the cluster. This process can be automated, but my question is, am I going to incur downtime of the cluster when rolling over the CA cert? If yes, how much?

Ulrich

You could, and cockroach does provide a way to reload certificates without restarting the nodes. See https://github.com/cockroachdb/docs/pull/1331 for the draft documentation on this functionality.

However regenerating the entire certificate chain for nodes is problematic as you will need to make sure all clients have the CA certificates.

In general, it’s recommended to have a separate process hold the CA key and generate all certificates. This is usually controlled by an operator as you need someone to decide who to issue certificates to.
Even when we add easier integration with existing PKI infrastructure, you’ll still need someone to approve trigger certificates for nodes and clients.

You are absolutely right that many businesses prefer to have operators on standby, expecting them to be trustworthy and not make any mistakes. I am rather in the other camp though :slight_smile:

Thus my installation is (hopefully) going to be automated and the nodes are going to be immutable, so there is no operator in the sense that he could manipulate a running node.

Of course there is someone, who decides that the cluster must be re-installed from scratch. For example if it has been compromised or, in my case, when a new node is to be added to the cluster. I assume both to be rare events, so my thought is why not do a CA rollover?

Ulrich

Rolling the CA still wouldn’t help as you still need all client and node certs to share the same CA, so you need to sign them all using the same cert. This means that the CA key must live long enough to sign the node certificates and client certificates. This means that it needs to live somewhere while those are being signed and distributed.

Sure, you can delete the CA key once you’ve done that first round. However, when you decide to add a single node or client, you’ll need to generate a new CA, sign the new cert with it, then make sure you give the oldCA + newCA to all nodes and clients, SIGHUP all nodes and restart all clients (sql libraries don’t usually support online certificate reload). A single client or node missing even one CA will cause rejected connections and could be really annoying to debug. Do you control all the clients? Is there a chance you’ll miss one? What about the nodes: what happens if one of them is temporarily unavailable? When it restarts, it won’t have the full set of CA certificates.

It feels like you’re trying to find a magic solution to certificate deployment. Unfortunately: there’s no such thing. Lots of systems exist to manage certificates for you (eg: Keywhiz, Vault, Netflix Lemur, etc…) but they all need some type of secrets storage and trusted central processor to receive certificate signing requests and sign them after operator approval. All of those have to be carefully setup and require some manual involvement.

Finally, it’s not like creating a cluster is secrets-free. You’ll usually need cloud credentials and probably private ssh keys, all of which need to be kept safe. Consider the CA key another one of those.

The plan is that I control all the clients and that they’re installed and updated automatically as well. I am not going to allow any users to connect to crdb directly, it’s all channeled through a bunch of microservices.

The idea is to create the CA key automatically on the first node and generate all certificates for the other nodes and clients, then delete the CA key. So there is no “trusted central processor”, it’s the first node that discovers it’s by itself and then knows it has to play CA for a while.

You think that automatic certificate deployment can’t be done? I believe that manual deployment is insecure and that Letsencrypt proves that automation is possible.

These days I am reluctant to use a system that needs some “magic touch” to get set up correctly. It’s really not a good idea to depend on manual steps.

Ulrich

If you control the clients perfectly then this is doable. I’m not sure why a node has to do this when the process that triggers certificate regeneration process can just as easily be the one to generate everything and push to nodes/clients.

One reason we don’t recommend this is that it’s going to be a rare scenario. Node and client certificates will be needed more or less frequently and the cluster admins may not be in control of client apps so will not be able to arbitrarily restart them.

As for automation, letsencrypt can do this through the multiple methods it has of verifying that the requester is indeed the owner of the resource listed in the certificate.
You’re technically doing this by having an authoritative list of nodes/clients you’re pushing the certificates to and the proper credentials to push certs/keys. A new node/client involves must first be added to the list of allowed certificate recipients. That’s the manual step. So your secret key becomes your machine credentials, and your verification step becomes your list of nodes and clients. It’s a slightly different way of doing it, but it boils down to the same requirements.

I do actually see a difference, although the jury is still out on its practical relevance. But please bear with me:

The first manual step in my case is that once a new VM is provisioned with the bare OS, I copy a script onto it, then manually log in and start the script, which changes the root password to something random and installs the software. Then I log out and can never log in again.

Meanwhile the script is on the VM and responsible for keeping it running. If there is a problem with the machine, all I can do is destroy it via the Cloud provider’s API (the equivalent of pulling the power plug from a hardware server). This process of taking the VM down is the second manual step. But with immutable infrastructure it isn’t allowed to change a system in-between commissioning and decommissioning.

So far, so theoretical. In practice I could of course keep the CA key locally, generate a new certificate once a new node is to be deployed and then copy the certificate along with the installation script onto the new VM. This would increase the potential attack vector only very marginally during installation. But if I’m raided by bad (or good) guys, they can use my CA key to eavesdrop on the running cluster. If I throw the CA key away, all they can do is create a new cluster (I do reckon that the cloud provider can be forced to relinquish my credentials to, say, the CIA).

As I said, I am still pondering the relevance, but here is the difference I promised you :slight_smile:

Finally, even from a purely practical point of view, I need to be able to roll over the CA. It might get compromised or I might even lose the CA key. So if I can automate the rollover process, then why not use that for installation as well.

Ulrich