Unable to connect secure nodes

Hi,

When I try to connect a second secure node to the first I get the following messages in the log:

I180615 18:46:43.504247 100 gossip/client.go:129 [n?] started gossip client to cockroach1.mynetwork.local:26257
I180615 18:46:43.509325 29 storage/stores.go:331 [n?] read 0 node addresses from persistent storage
I180615 18:46:43.509424 29 storage/stores.go:350 [n?] wrote 1 node addresses to persistent storage
I180615 18:46:43.509441 29 server/node.go:653 [n?] connecting to gossip network to verify cluster ID…
I180615 18:46:43.509489 29 server/node.go:678 [n?] node connected via gossip and verified as part of cluster “c731ee5f-1f9d-42fc-a7de-74aca3570db7”
W180615 18:47:13.487444 330 server/server.go:1267 The server appears to be unable to contact the other nodes in the cluster. Please try

  • starting the other nodes, if you haven’t already
  • double-checking that the ‘–join’ and ‘–host’ flags are set up correctly
  • running the ‘cockroach init’ command if you are trying to initialize a new cluster

If problems persist, please see https://www.cockroachlabs.com/docs/v2.0/cluster-setup-troubleshooting.html.
I180615 18:47:19.432002 134 gossip/gossip.go:1306 [n?] node has connected to cluster via gossip
I180615 18:47:19.432102 134 storage/stores.go:350 [n?] wrote 1 node addresses to persistent storage

How do I troubleshoot this further?

I have two nodes that succesfully connect in --insecure mode. When I clean the nodes and retry with the --certs-dir option, the first node comes up without a problem. The second however shows the log messages above and doesn’t show up in the web interface overview of the first node.

Could you tell us a bit about your deployment? If you haven’t seen them, we have step by step guides outlining how to spin up a secure cluster step by step for both manual and orchestrated deployments; usually a missed step there is what’s causing the issue.

Once you’ve run through the docs, if they don’t help, could you also send over the commands you’re using to start both nodes on the cluster?

From what I gather from the manual step by step, I didn’t use the --join parameter for the first node, which resulted in the logs above on the second node when trying to connect to the first.

When following the manual step by step I get an error in step 4 when initializing the cluster. I used the following commands:

server1:

cockroach start --host server1.mydomain.local --certs-dir=/etc/cockroach/certs --join=server1.mydomain.local:26257,server2.mydomain.local:26257

server2:

cockroach start --host server2.mydomain.local --certs-dir=/etc/cockroach/certs --join=server1.mydomain.local:26257,server2.mydomain.local:26257

admin client:

bin/cockroach init --certs-dir=certs --host=server2.mydomain.local:26257

E180615 20:59:04.349377 1 cli/error.go:109 unable to connect or connection lost.

Please check the address and credentials such as certificates (if attempting to
communicate with a secure cluster).

initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
Error: unable to connect or connection lost.

The log message on the client suggests there is something wrong with the certificates, but when I check the certificates using cockroach cert list everything seems to be in order:

admin client:

bin/cockroach cert list --certs-dir=/opt/cockroach/certs
Certificate directory: /opt/cockroach/certs
±----------------------±-----------------±----------------±-----------±-------------±------+
| Usage | Certificate File | Key File | Expires | Notes | Error |
±----------------------±-----------------±----------------±-----------±-------------±------+
| Certificate Authority | ca.crt | | 2042/08/19 | num certs: 3 | |
| Client | client.root.crt | client.root.key | 2020/06/13 | user: root | |
±----------------------±-----------------±----------------±-----------±-------------±------+

server1:

cockroach cert list --certs-dir=/etc/cockroach/certs
Certificate directory: /etc/cockroach/certs
±----------------------±-----------------±---------±-----------±----------------------------------±------+
| Usage | Certificate File | Key File | Expires | Notes | Error |
±----------------------±-----------------±---------±-----------±----------------------------------±------+
| Certificate Authority | ca.crt | | 2042/08/19 | num certs: 3 | |
| Node | node.crt | node.key | 2020/06/12 | addresses: server1.mydomain.local | |
±----------------------±-----------------±---------±-----------±----------------------------------±------+

Any suggestions?

What does cert list look like on server2? Provided that each node possesses a node certificate and key, and CA certificate issued by the same CA, there should be no issues connecting.

That error (unable to connect or connection lost) indicates that the issue is likely connectivity rather than the certs. Did anything change between your insecure test and secure test?

If not, and you can verify all nodes can connect, then in order to help further, we’d need to see step by step what commands you’re running, including the cert creation and the current paths where you’re running the commands.

Appreciate your patience while we figure out the root cause here.

Hey @jorisdevrede, how are things looking? Were you able to get everything connected?

Hi,

I’m one step further. I have created the same cluster only this time with coackroach self-signed certificates, instead of the intended certificates. That works like a charm. Just start the nodes as instructed and run the init to get things going.

So the root cause lies somewhere with my certificates. The problem now is that cockroach just quietly fails without giving a solid hint as to why. Is it possible to make the logging more verbose?

Glad to hear you’re past the certificate issue!

The docs on logging configuration are here. By default, the debug logs are located at <nodeStore>/logs/, and will log all messages. It’d be really surprising if the process was failing without a message present in the logs to give a clue as to what happened.

Also take a look at our docs for generating certificates using openssl. This should give you guidance about what’s required that you can translate into your own CA/CSRs. The key requirements are in the blue “Note” boxes: Certain keyUsage values must be present, and the common name field must be node for node certificates and the user name for client certificates.

The common name requirement is the most common stumbling block for using external CAs; we’re working on making this more flexible.

Ahh, that explains a lot!

Unfortunately this common name requirement is too limiting for our certficate policy. Please let me know when you have a configurable solution. I’d be happy to test it using our certificates.