Setting up a secure cluster using Docker Containers

Hi, I am trying to set up a test secure cluster on AWS using docker containers. I’ve been able to successfully follow this guide to set up an insecure cluster. However, as I’m trying to deploy a secure cluster, I’ve ran into what seems like certificate issue and couldnt find any documentation around deploying a secure cluster using docker. Here is what I’ve tried so far:

All three ec2 instances on AWS are sitting inside the same VPC and have an open security group that allows all TCP connections on all ports.

I’ve generated the certificates for each instance following this guide:

cockroach cert create-node \
$(host_external_ip) \
$(host_internal_ip) \
localhost \
127.0.0.1 \
--certs-dir=$(CKDB_LOCAL_CERTS_DIR) \
--ca-key=$(CKDB_LOCAL_CA_KEY)

and built custom docker images (off the official cockroachdb image) with the generated crts and keys inside the image:

FROM cockroachdb/cockroach:v2.0.2

COPY ./ckdb/certs/ca.crt certs/
COPY ./ckdb/certs/node.crt certs/
COPY ./ckdb/certs/node.key certs/

EXPOSE 26257
EXPOSE 8080

Now on each ec2 hosts, I’ve ran the following docker command to start the ckdb nodes:

sudo docker run \
--rm \
-d \
-v `pwd`/cockroach-data:/cockroach/cockroach-data \
--publish 26257:26257 \
--publish 8080:8080 \
--name=roach \
--network=host \
$(docker_image_name_built_above) \
start --certs-dir=certs/ --join=$(internal_ips_of_all_three_nodes)

and initialized the cluster by running:

sudo docker exec -it roach ./cockroach init --certs-dir=certs/

then if i check the node status, it only shows its local node as part of the cluster showing that the nodes are not able to reach each other and the logs confirm that:

I180720 20:05:01.409697 77 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp ip-172-31-26-114:26257}
I180720 20:05:01.409843 29 gossip/client.go:136  [n?] closing client to 172.31.26.114:26257: stopping outgoing client to node 0 (172.31.26.114:26257); loopback connection
W180720 20:05:02.401708 93 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.31.23.24:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.23.24:26257: connect: connection refused". Reconnecting...
W180720 20:05:02.401785 78 gossip/client.go:123  [n?] failed to start gossip client to 172.31.23.24:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
W180720 20:05:03.401172 93 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.31.23.24:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W180720 20:05:03.401528 93 vendor/google.golang.org/grpc/clientconn.go:830  Failed to dial 172.31.23.24:26257: context canceled; please retry.
W180720 20:05:03.401909 103 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.31.24.114:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.24.114:26257: connect: connection refused". Reconnecting...
W180720 20:05:03.401961 99 gossip/client.go:123  [n?] failed to start gossip client to 172.31.24.114:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure
I180720 20:05:04.401712 95 gossip/client.go:129  [n?] started gossip client to 172.31.26.114:26257
W180720 20:05:04.401893 103 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.31.24.114:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W180720 20:05:04.401967 103 vendor/google.golang.org/grpc/clientconn.go:830  Failed to dial 172.31.24.114:26257: context canceled; please retry.
I180720 20:05:04.402055 107 gossip/server.go:219  [n?] received initial cluster-verification connection from {tcp ip-172-31-26-114:26257}
I180720 20:05:04.402361 95 gossip/client.go:136  [n?] closing client to 172.31.26.114:26257: stopping outgoing client to node 0 (172.31.26.114:26257); loopback connection
W180720 20:05:05.402598 117 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.31.23.24:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.23.24:26257: connect: connection refused". Reconnecting...
W180720 20:05:06.402101 117 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.31.23.24:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W180720 20:05:06.402397 117 vendor/google.golang.org/grpc/clientconn.go:830  Failed to dial 172.31.23.24:26257: context canceled; please retry.
W180720 20:05:06.402764 133 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.31.24.114:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.24.114:26257: connect: connection refused". Reconnecting...
W180720 20:05:07.402485 133 vendor/google.golang.org/grpc/clientconn.go:1158  grpc: addrConn.createTransport failed to connect to {172.31.24.114:26257 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...

Given that the insecure connection has worked and that the security group for all ec2 instances is open to all traffic, I’m inclined to believe that I missed a step here to configure the certificates to work within the docker containers, any help would be much appreciated it.

1 Like

Hello.

I think you are missing the --advertise-host ? (Not sure if you need it when you are using the host network-mode but I think yes)
You should use it with the name or IP of the host instance because will default it will advertise the container IP that is not reachable from outside the host.

Also, you should use volume mapping instead of putting the certificates through a custom Dockerfile. (So it’s easier to update between versions, just swap the image)

Hey @Jeongp,

We do have a guide for orchestrating a secure CRDB cluster using Docker Swarm here - comparing your steps and ours, I think what’s missing is docker’s secret.

Try running through the steps outlined in that doc and see if it resolves the issue - I suspect it should.

Thanks for the reply Kedare,

I’ve tried setting the --advertise-host to the public IP of the hosts, confirmed by doing a CAT on the cockroach.advertise-addr file in the cockroach-data folder. As well I’ve used volume mapping to get the certificates into the container. However, I am still reaching the same error as before of TCP connection refused.

I am trying to understand what the difference in networking between the “secure” and “insecure” model, as when i run the exact same commands with just the --insecure appended, all the nodes connect correctly. As far as i can tell, the only difference is the encryption of traffic, so I still suspect that the issue is with the certs and keys, perhaps the way I’m generating it?

Given that I’m running the containers in “host” networking, i dont think I needed to add any other IPs when I’m generating the certificate other than the below:

cockroach cert create-node \
$(EXTERNAL_IP) \
$(INTERNAL_IP) \
localhost \
127.0.0.1 \

Hi Tim, thanks for the reply.

For our use-case, we will be facilitating multi-enterprise where each enterprise deploys their own node on separate VPCs and connect to each other using the public IPs of the nodes. As such, we did not think that swarm was the best way to facilitate such deployment, please feel free to correct me if I’m wrong here.

Is there any existing documentation out there that goes through secure deployment with standalone docker?

Are those all the logs? It looks like the init command hadn’t done anything yet by that point.

Also, the errors don’t look like certificate errors. I thought certificate-related errors were usually more specific about the fact that they’re rejecting connections due to invalid certificates. This looks more like a network misconfiguration to me.

Given that you’re using --network=host, the fact that cockroach is running in docker should really be a non-factor here. The only effect docker usually has is making networking more complicated, but --network=host obviates that.

I think it’d help to more fully understand what you’re doing and what error you’re hitting. Ideally that would mean you provide:

  1. The exact commands you run to start cockroach and init the cluster, the output of those commands, and the log files produced when you run with --insecure
  2. The exact commands you run to start cockroach and init the cluster, the output of those commands, and the log files produced when you run using certs.

In particular, I’m most curious about:

  1. Whether you’re really doing everything the same when running secure vs insecure
  2. Whether the init command actually worked (since it looks from your logs like it didn’t), and if not what its output was.
  3. If the init command worked, what’s in the secure nodes’ logs after that point.

Hi Alex,
I’ve posted all the relevant information commands, outputs and logs below. For both the “Secure” and “Insecure” testing, it was done with two hosts, with one of the hosts initializing the cluster after both hosts have started their respective nodes.


Here is the commands and the output when ran with --insecure :

FIRST HOST (one used to initialize the cluster) :

ubuntu@ip-172-31-26-114:~$ (first-dev-host): make run_dev_ckdb_node_first
sudo docker run \
--rm \
-d \
-v `pwd`/cockroach-data:/cockroach/cockroach-data \
-v `pwd`/certs:/cockroach/certs \
--publish 26257:26257 \
--publish 8080:8080 \
--name=roach \
--network=host \
cockroachdb/cockroach:v2.0.2 \
start --certs-dir=certs/ --join=18.144.56.88:26257,13.57.188.74:26257 --insecure
WARNING: Published ports are discarded when using host network mode
85de7860717befeb3b0b5d736f52b006d2a9a62f80cd5d370510370bcd26923e


ubuntu@ip-172-31-26-114:~$ (first-dev-host): make init_ckdb_cluster
sudo docker exec -it roach ./cockroach init --certs-dir=certs/ --insecure
Cluster successfully initialized


ubuntu@ip-172-31-26-114:~$ (first-dev-host): make show_ckdb_node_status
sudo docker exec -it roach ./cockroach node status --certs-dir=certs/ --insecure
+----+------------------------+--------+---------------------+---------------------+---------+
| id |        address         | build  |     updated_at      |     started_at      | is_live |
+----+------------------------+--------+---------------------+---------------------+---------+
|  1 | ip-172-31-26-114:26257 | v2.0.2 | 2018-07-24 20:15:26 | 2018-07-24 20:15:26 | true    |
|  2 | ip-172-31-23-24:26257  | v2.0.2 | 2018-07-24 20:15:27 | 2018-07-24 20:15:27 | true    |
+----+------------------------+--------+---------------------+---------------------+---------+
(2 rows)

SECOND HOST:

ubuntu@ip-172-31-23-24:~ (second-dev-host)$ make run_dev_ckdb_node_second
sudo docker run \
--rm \
-d \
-v `pwd`/cockroach-data:/cockroach/cockroach-data \
-v `pwd`/certs:/cockroach/certs \
--publish 26257:26257 \
--publish 8080:8080 \
--name=roach \
--network=host \
cockroachdb/cockroach:v2.0.2 \
start --certs-dir=certs/ --join=18.144.56.88:26257,13.57.188.74:26257 --insecure
WARNING: Published ports are discarded when using host network mode
1450e58d143614c62efd3d41a40a87dd6cc5c441ebc2b47323e8f6271bea56ce


ubuntu@ip-172-31-23-24:~ (second-dev-host)$ make show_ckdb_node_status
sudo docker exec -it roach ./cockroach node status --certs-dir=certs/ --insecure
+----+------------------------+--------+---------------------+---------------------+---------+
| id |        address         | build  |     updated_at      |     started_at      | is_live |
+----+------------------------+--------+---------------------+---------------------+---------+
|  1 | ip-172-31-26-114:26257 | v2.0.2 | 2018-07-24 20:15:26 | 2018-07-24 20:15:26 | true    |
|  2 | ip-172-31-23-24:26257  | v2.0.2 | 2018-07-24 20:15:27 | 2018-07-24 20:15:27 | true    |
+----+------------------------+--------+---------------------+---------------------+---------+
(2 rows)

Log: https://pastebin.com/zZMXVxTx


Here is the commands and output for “secure” model:

FIRST HOST(one used to initialize the cluster):

ubuntu@ip-172-31-26-114:~$ (first-dev-host): make run_dev_ckdb_node_first
sudo docker run \
--rm \
-d \
-v `pwd`/cockroach-data:/cockroach/cockroach-data \
-v `pwd`/certs:/cockroach/certs \
--publish 26257:26257 \
--publish 8080:8080 \
--name=roach \
--network=host \
cockroachdb/cockroach:v2.0.2 \
start --certs-dir=certs/ --join=18.144.56.88:26257,13.57.188.74:26257
WARNING: Published ports are discarded when using host network mode
ecca440289da70527c1cc5c057f96f91f1289b48c9995533ed165bcdaf966324


ubuntu@ip-172-31-26-114:~$ (first-dev-host): make init_ckdb_cluster
sudo docker exec -it roach ./cockroach init --certs-dir=certs/
Cluster successfully initialized


ubuntu@ip-172-31-26-114:~$ (first-dev-host): make show_ckdb_node_status
sudo docker exec -it roach ./cockroach node status --certs-dir=certs/
+----+------------------------+--------+---------------------+---------------------+---------+
| id |        address         | build  |     updated_at      |     started_at      | is_live |
+----+------------------------+--------+---------------------+---------------------+---------+
|  1 | ip-172-31-26-114:26257 | v2.0.2 | 2018-07-24 20:42:39 | 2018-07-24 20:42:39 | true    |
+----+------------------------+--------+---------------------+---------------------+---------+
(1 row)

SECOND HOST:

ubuntu@ip-172-31-23-24:~ (second-dev-host)$ sudo rm -rf cockroach-data/
ubuntu@ip-172-31-23-24:~ (second-dev-host)$ make run_dev_ckdb_node_second
sudo docker run \
--rm \
-d \
-v `pwd`/cockroach-data:/cockroach/cockroach-data \
-v `pwd`/certs:/cockroach/certs \
--publish 26257:26257 \
--publish 8080:8080 \
--name=roach \
--network=host \
cockroachdb/cockroach:v2.0.2 \
start --certs-dir=certs/ --join=18.144.56.88:26257,13.57.188.74:26257
WARNING: Published ports are discarded when using host network mode
49e8f199fc5ab40029f1d8801660c976e77d57b64f6ea301a20221dc3b8b327e


ubuntu@ip-172-31-23-24:~ (second-dev-host)$ make show_ckdb_node_status
sudo docker exec -it roach ./cockroach node status --certs-dir=certs/
Error: unable to connect or connection lost.

Please check the address and credentials such as certificates (if attempting to
communicate with a secure cluster).

rpc error: code = Unavailable desc = node waiting for init; /cockroach.server.serverpb.Status/Nodes not available
Failed running "node"
Makefile:1193: recipe for target 'show_ckdb_node_status' failed
make: *** [show_ckdb_node_status] Error 1
ubuntu@ip-172-31-23-24:~ (second-dev-host)$

As you can see in the output of the second host, it cannot connect to the first host, as such this node doesnt belong to any initialized cluster and node status command errors out.

Logs: https://pastebin.com/rTF2Z8zs



Sorry for the large log dump, I couldnt tell which parts would be relevant for debugging. Any guidance would be much appreciated, and please let me know if you need any more details.

So after looking through the logs in more detail, I have been able to figure out my issue, and it was a sloppy mistake on my part. I forgot to append the --overwrite option when running cockroach cert create-node, so the certificates for the first host was also being uploaded to the second host.

However, this wasnt the only issue, when running the cockroach cert create-node command to generate the certificate, I passed in the internal IP for the host in the format XXX.XX.XX.XX, but apparently i also had to pass in the IP in this format ip-XXX-XX-XX-XX which I found very odd and still unsure if it is an AWS host thing or part of how CKDB grabs the IP from the host.

Thanks for the help everyone.

I’m glad you figured it out!

Using the --advertise-host flag would eliminate the need to include the ip-XXX-XX-XX-XX name in the certificates. What’s happening there is that by default, if no --host or --advertise-host flag is provided to the cockroach start command, the cockroach process will advertise its hostname to other nodes as its address. It looks like ip-XXX-XX-XX-XX is the hostname of the machine in question. You could set --advertise-host to the IP addresses to get around the need to include the hostname in the certificates.

1 Like