Sql client get hang for docker swarm environment on Centos 7

Hello

I’ve built a docker swarm environment for cockroachdb on Centos7. The result of “docker service ls” looks fine. When I use the following command to invoke the sql client, but got hang. The iptables on the nodes are all down. No idea why they are hang. Any hint?

Great thanks!

@daniel, could you provide more detail about how you set up your swarm? We have a tutorial here. Did you follow those steps?

Hi Jesse,

Thank you for your reply. I almost followed the steps in the page of https://www.cockroachlabs.com/docs/orchestrate-cockroachdb-with-docker-swarm.html (the only change is that I’m using the docker image for 20170112 rather than 20170126)

the following is my step:

  1. disable the iptables on all nodes and shutdown the selinux
  2. setup the ssh stuff and assure all the node can connect each other without password.
  3. clock synchronization and keep all node the same time
  4. docker install and docker image pull
  5. start a swarm and let other node join by the join-token

[master]
docker swarm init --advertise-addr 192.168.214.59

[worker]
docker swarm join
–token SWMTKN-1-3u55q5h5qv0zsb5o3ofi7vlpls61uby3jirhg9mwv9sk2cjb9j-7j6mvt1u3kdwy5quk9ivw9gxj
192.168.214.59:2377

and “docker node ls” looks good

[root@cr1 ~]# docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
6q223qzzjzilv0m40vv7fm3yj cr3.tpl.com Ready Active
a2w8zrvdfkc9n3wm4fo7d68dj cr5.tpl.com Ready Active
byr6tqak5atqubb0d3aqouwhc cr2.tpl.com Ready Active
cut5cxvpjfc3z7r9zfay845p5 * cr1.tpl.com Ready Active Leader
eplbf7f76yf5d0ns6sadchq9b cr4.tpl.com Ready Active

6.start the first node
docker service create --replicas 1 --name cockroachdb-0 --network cockroachdb
–mount type=volume,source=cockroachdb-0,target=/cockroach/cockroach-data,volume-driver=local
–stop-grace-period 60s
cockroachdb/cockroach:beta-20170112 start --advertise-host=cockroachdb-0 --logtostderr --insecure

after this step, the sql cli works fine on the node the docker contains lives in.

7.create another 2 nodes
docker service create --replicas 1 --name cockroachdb-1 --network cockroachdb
–mount type=volume,source=cockroachdb-1,target=/cockroach/cockroach-data,volume-driver=local
–stop-grace-period 60s
cockroachdb/cockroach:beta-20170112 start --advertise-host=cockroachdb-1
–join=cockroachdb-0:26257 --logtostderr --insecure

docker service create --replicas 1 --name cockroachdb-2 --network cockroachdb
–mount type=volume,source=cockroachdb-2,target=/cockroach/cockroach-data,volume-driver=local
–stop-grace-period 60s
cockroachdb/cockroach:beta-20170112 start --advertise-host=cockroachdb-2
–join=cockroachdb-0:26257 --logtostderr --insecure

after it, I found I cannot connect with the sql cli on the 2 joining nodes.

8.remove the rockroachdb-0 and let it rejoin
docker service rm cockroachdb-0
docker service create --replicas 1 --name cockroachdb-0 --network cockroachdb
–mount type=volume,source=cockroachdb-2,target=/cockroach/cockroach-data,volume-driver=local
–stop-grace-period 60s
cockroachdb/cockroach:beta-20170112 start --advertise-host=cockroachdb-0
–join=cockroachdb-1:26257 --logtostderr --insecure

after this step, all the nodes cannot connect by sql cli.

Please let me know if I missed any step or mistyped any command. Any idea or suggestion is welcome

Great thanks!

Thanks for those extra details, @daniel.

@a-robinson, do you have any idea why sql clients wouldn’t connect in this case?

Hi Daniel,

I have a few questions to help drill down on what’s going on.

  1. Did you set up the overlay network (using docker network create --driver overlay cockroachdb)?
  2. What do you mean when you say that you “disable the iptables on all nodes”?
  3. When you deleted and recreated node 0, did it end up on the same machine as the first time or on a different one?
  4. Before removing cockroachdb-0, If you exec yourself into one of the other two processes, can you open up a SQL shell to cockroachdb-0? The process would basically be:
# SSH to the node that cockroachdb-1 or cockroachdb-2 is on and get its container ID by running 'docker ps', then run
docker exec -it <container-id> ./cockroach sql --host=cockroachdb-0

Hi Robinson,

Thank you for you email. The following is my answers to your questions. Please check.

  1. Yes. I’ve set up the overlay network according to the doc. (sorry didn’t list it in my former email.

     [root@cr1 ~]# docker network ls
     NETWORK ID          NAME                DRIVER              SCOPE
     2a386b83de7b        bridge              bridge              local               
     881888ycmvuv        cockroachdb         overlay             swarm               
     ffa750a80321        docker_gwbridge     bridge              local               
     1f5909a84189        host                host                local               
     b6mx9yi6wucy        ingress             overlay             swarm               
     c072a0e5eab7        none                null                local               
     99a02153c09a        roachnet            bridge              local               
    
  2. I’ve disabled all the iptables service on the nodes in case any port issues

    [root@cr1 ~]# service iptables status
    Redirecting to /bin/systemctl status iptables.service
    鈼iptables.service
    Loaded: not-found (Reason: No such file or directory)
    Active: inactive (dead)

  3. after I recreated rockroachdb-0 it is now on the first node (cr1), while it was on (cr3)

  4. before deleting, the rockroachdb-0 and rockroachdb-1 were on cr3, and rockroachdb-2 was on cr5. at that time, I found I can connect rockroachdb-0 & rockroachdb-1 via sql cli on cr3 but cannot from cr5 .
    after recreated the rockroachdb-0, rockroachdb-0 is now on cr1, while rockroachdb-1 on cr3 and rockroachdb-2 on cr5 and none of them could connect via sql cli.
    It looks like a port issue, but I’ve put the iptables down so it really got me crazy~~hahaha, any hint on it?

Great thanks!

Ok, I believe this all makes sense then. I’d re-enable iptables and try again. In normal configurations, Docker relies on being able to set up certain iptables rules to make container networking work right.

Hi Robinson,

I’ve re-enabled the iptables and open the 2377,8080,26257 tcp port for the 5 nodes.

after I removed all the cockroach service and re-booted all the server, I checked the iptables

Chain IN_public_allow (1 references)
target     prot opt source               destination         
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:22 ctstate NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:11 ctstate NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:26257 ctstate NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:5911 ctstate NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:2377 ctstate NEW
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 ctstate NEW

and the following is the status for docker swarm & network status

[root@cr1 ~]# docker node ls
ID                           HOSTNAME     STATUS  AVAILABILITY  MANAGER STATUS
4ijtruqgh2ni52gbjhmak2jan    cr2.tpl.com  Ready   Active        
6q223qzzjzilv0m40vv7fm3yj    cr3.tpl.com  Ready   Active        
a2w8zrvdfkc9n3wm4fo7d68dj    cr5.tpl.com  Ready   Active        
cut5cxvpjfc3z7r9zfay845p5 *  cr1.tpl.com  Ready   Active        Leader
eplbf7f76yf5d0ns6sadchq9b    cr4.tpl.com  Ready   Active        

[root@cr1 ~]# docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
f8c9a970d18d        bridge              bridge              local               
881888ycmvuv        cockroachdb         overlay             swarm               
ffa750a80321        docker_gwbridge     bridge              local               
1f5909a84189        host                host                local               
b6mx9yi6wucy        ingress             overlay             swarm               
c072a0e5eab7        none                null                local               
99a02153c09a        roachnet            bridge              local   

Then I recreated the cockroachdb-0:

[root@cr1 ~]# docker service create --replicas 1 --name cockroachdb-0 --network cockroachdb \
> --mount type=volume,source=cockroachdb-0,target=/cockroach/cockroach-data,volume-driver=local \
> --stop-grace-period 60s \
> cockroachdb/cockroach:beta-20170112 start --advertise-host=cockroachdb-0 --logtostderr --insecure
19b8s5boqlxvpx3rgvbg2pb60

[root@cr1 ~]# docker service ls
ID            NAME           REPLICAS  IMAGE                                COMMAND
19b8s5boqlxv  cockroachdb-0  1/1       cockroachdb/cockroach:beta-20170112  start --advertise-host=cockroachdb-0 --logtostderr --insecure
[root@cr1 ~]# docker service ps cockroachdb-0
ID                         NAME             IMAGE                                NODE         DESIRED STATE  CURRENT STATE           ERROR
5rmafvabchew0xtlv3uwnigs5  cockroachdb-0.1  cockroachdb/cockroach:beta-20170112  cr2.tpl.com  Running        Running 24 minutes ago  

and after this step I can connect via SQL CLI from cr2

[root@cr2 ~]# docker exec -it  $(docker ps | grep cockroachdb| awk {'print $1'}) ./cockroach sql
# Welcome to the cockroach SQL interface.
# All statements must be terminated by a semicolon.
# To exit: CTRL + D.
root@:26257> 

and then I recreated cockroachdb-1 with the following command.

[root@cr1 ~]# docker service create --replicas 1 --name cockroachdb-1 --network cockroachdb \
> --mount type=volume,source=cockroachdb-1,target=/cockroach/cockroach-data,volume-driver=local \
> --stop-grace-period 60s \
> cockroachdb/cockroach:beta-20170112 start --advertise-host=cockroachdb-1 \
> --join=cockroachdb-0:26257 --logtostderr --insecure
4x6ix0xyodsmvjzt2giibxx7o

[root@cr1 ~]# docker service ls
ID            NAME           REPLICAS  IMAGE                                COMMAND
19b8s5boqlxv  cockroachdb-0  1/1       cockroachdb/cockroach:beta-20170112  start --advertise-host=cockroachdb-0 --logtostderr --insecure
4x6ix0xyodsm  cockroachdb-1  1/1       cockroachdb/cockroach:beta-20170112  start --advertise-host=cockroachdb-1 --join=cockroachdb-0:26257 --logtostderr --insecure

[root@cr1 ~]# docker service ps cockroachdb-1
ID                         NAME             IMAGE                                NODE         DESIRED STATE  CURRENT STATE           ERROR
3davrg5vgrmeix7lefb00lu7s  cockroachdb-1.1  cockroachdb/cockroach:beta-20170112  cr5.tpl.com  Running        Running 18 minutes ago  

However when I try to connect via SQL CLI from cr5, it hang and I cannot connect from cr5

[root@cr5 ~]docker ps
CONTAINER ID        IMAGE                                 COMMAND                  CREATED             STATUS              PORTS                 NAMES
efea68b94ef8        cockroachdb/cockroach:beta-20170112   "/cockroach/cockroach"   51 seconds ago      Up 50 seconds       8080/tcp, 26257/tcp   cockroachdb-1.1.3davrg5vgrmeix7lefb00lu7s
[root@cr5 ~]# docker exec -it  $(docker ps | grep cockroachdb| awk {'print $1'}) ./cockroach sql

I’m not sure if the docker logs on cr5 can help

[root@cr5 ~]# docker logs efea68b94ef8
I170213 08:13:31.588971 1 cli/start.go:320  CockroachDB beta-20170112 (linux amd64, built 2017/01/12 18:27:36, go1.7.3)
I170213 08:13:31.697192 1 cli/start.go:336  starting cockroach node
W170213 08:13:31.770252 1 server/server.go:156  [n?] running in insecure mode, this is strongly discouraged. See --insecure.
W170213 08:13:31.770933 1 gossip/gossip.go:1130  [n?] no incoming or outgoing connections
I170213 08:13:31.790337 1 storage/engine/rocksdb.go:326  opening rocksdb instance at "cockroach-data"
I170213 08:13:31.831609 1 server/config.go:456  1 storage engine initialized
W170213 08:13:32.002123 44 gossip/gossip.go:939  [n?] invalid bootstrap address: &{typ:tcp addr:cockroachdb-0:26257}, lookup cockroachdb-0 on 127.0.0.11:53: server misbehaving
I170213 08:13:32.008059 1 server/node.go:426  [n?] store [n0,s0] not bootstrapped
I170213 08:13:32.008153 1 storage/stores.go:296  [n?] read 0 node addresses from persistent storage
I170213 08:13:32.008342 1 server/node.go:569  [n?] connecting to gossip network to verify cluster ID...
[root@cr5 ~]#

is the issue here? no idea why it will “lookup cockroachdb-0 on 127.0.0.11:53”?

[quote="daniel, post:8, topic:449"]
W170213 08:13:32.002123 44 gossip/gossip.go:939  [n?] invalid bootstrap address: &{typ:tcp addr:cockroachdb-0:26257}, lookup cockroachdb-0 on 127.0.0.11:53: server misbehaving
[/quote]

That’s the IP address that Docker runs its embedded DNS server on, so it looks like there’s still an issue with the Docker networking setup.

It might be worth reproducing this with a simpler Docker image (e.g. running nginx on one node and trying to connect to it by name from a container on another node) and asking the Docker folks about it. I’m unfortunately not an expert on Docker network setup.

Also, it may be worth removing the docker volumes from your past attempts, since as noted in https://www.cockroachlabs.com/docs/orchestrate-cockroachdb-with-docker-swarm.html#step-10-stop-the-cluster, the volumes are left around by default which could potentially interfere with later attempts to start a cockroach cluster from scratch.