Cockroach UI showing Dead Node even node is running in same cluster


(hrishikesh srivastava) #1

Hi,

I am using cluster with 3 nodes and below is syntax of those to make up those nodes:

Node 1:

root@172.16.23.29:/usr/rishi rwxr-xr-x # cockroach start --insecure --http-addr=172.16.23.29:7080

Node 2:
root@172.16.23.29:~ r-xr-x— # cockroach start --insecure --store=node2 --listen-addr=172.16.23.29:26258 --http-addr=172.16.23.29:7081 --join=172.16.23.29:26257

Node 3:
root@172.16.23.29:~ r-xr-x— # cockroach start --insecure --store=node3 --listen-addr=172.16.23.29:26259 --http-addr=172.16.23.29:7082 --join=172.16.23.29:26257

While, I am looking on admin UI from node first: http://172.16.23.29:7080/#/overview/list , I can see only one node is running

whereas, if opening UI for other two nodes it shows that 2 node is live and one is DEAD node

Due to this , I am getting my transaction data only in first node database whereas not in other two nodes.

root@172.16.23.29:~ r-xr-x— # cockroach sql --insecure --host=172.16.23.29:26258

Welcome to the cockroach SQL interface.

All statements must be terminated by a semicolon.

To exit: CTRL + D.

Server version: CockroachDB CCL v2.1.4 (x86_64-unknown-linux-gnu, built 2019/01/16 16:05:40, go1.10.7) (same version as client)

Cluster ID: 8b7259c1-5331-4e97-ba53-7072a261ef5c

Enter ? for a brief introduction.

root@172.16.23.29:26258/defaultdb>
root@172.16.23.29:26258/defaultdb> select count(*) from issue.transaction;
count
±------+
0
(1 row)

Time: 652.494µs

root@172.16.23.29:26258/defaultdb>

==============

root@172.16.23.29:~ r-xr-x— # cockroach sql --insecure --host=172.16.23.29:26257

Welcome to the cockroach SQL interface.

All statements must be terminated by a semicolon.

To exit: CTRL + D.

Server version: CockroachDB CCL v2.1.4 (x86_64-unknown-linux-gnu, built 2019/01/16 16:05:40, go1.10.7) (same version as client)

Cluster ID: d9a1130f-3b11-4e75-9fc8-3ea6525b0a0e

Enter ? for a brief introduction.

root@172.16.23.29:26257/defaultdb>
root@172.16.23.29:26257/defaultdb>
root@172.16.23.29:26257/defaultdb> select count(*) from issue.transaction;
count
±------+
2809
(1 row)

Time: 2.420273ms

root@172.16.23.29:26257/defaultdb>

Please suggest how to fix this? I am using port 7080 for my database.


(Tim O'Brien) #2

Hi @rishi2019,

It looks like your first node is on an independent cluster. It’s also missing a store. You can try wiping the stores and restarting the cluster following these instructions. I suspect your second and third nodes are not joining the first because the first is missing an --listen-addr. If you were to kill the existing nodes, wipe all the stores and restart the nodes in this order:

$ cockroach start --insecure --store=node1 --listen-addr=172.16.23.29:26257 --http-addr=172.16.23.29:7080
$ cockroach start --insecure --store=node2 --listen-addr=172.16.23.29:26258 --http-addr=172.16.23.29:7081 --join=172.16.23.29:26257
$ cockroach start --insecure --store=node3 --listen-addr=172.16.23.29:26259 --http-addr=172.16.23.29:7082 --join=172.16.23.29:26257

You’d likely be fine.

If you just want to proceed with testing you could spin up a third node and just have it join the existing cluster:

$ cockroach start --insecure --http-addr=172.16.23.29:7080 --store=node1 --join=172.16.23.29:26257

Hope that helps!


(hrishikesh srivastava) #3

Thanks @tim-o,

You’re right. I got appropriate result this way.


(Tim O'Brien) #4

Great! Glad that helped.


(hrishikesh srivastava) #5

Hi,

I was using following command as instructed to create cluster with 4 nodes having two different servers:

Server 172.16.23.29:
Node 1 : cockroach start --insecure --store=node1 --listen-addr=172.16.23.29:26257 --http-addr=172.16.23.29:7080
Node 2 : cockroach start --insecure --store=node2 --listen-addr=172.16.23.29:26258 --http-addr=172.16.23.29:7081 --join=172.16.23.29:26257
Node 3 : cockroach start --insecure --store=node3 --listen-addr=172.16.23.29:26259 --http-addr=172.16.23.29:7082 --join=172.16.23.29:26257

Server 172.16.23.26:
Node 4 : cockroach start --insecure --store=node4 --listen-addr=172.16.23.26:26260 --http-addr=172.16.23.26:7083 --join=172.16.23.29:26257

Everything was working fine and it was showing “4 Live Nodes” in UI.

Later I opened different console and check running PID for cockroach:

root@172.16.23.29:~ r-xr-x— # ps -ef | grep cockroach

and killed all running PID for cockroach via $kill -9 PID command.

After it, I tried to start all nodes again using same command as mentioned above, It showing 4 nodes, but First node having different cluster ID and other 3 nodes having same cluster ID.

Due to this first node is showing dead from UI of http://172.16.23.29:7081/#/overview/list

Not sure whats going wrong here and how to get node1 again with same cluster ID. Please help to sort this issue.


(Tim O'Brien) #6

Node 1 in your example above has no join flags. The best practice is to specify each node in the --join flag on all nodes. Try:

--join=172.16.23.29:26257,172.16.23.29:26258,172.16.23.29:26259,172.16.23.26:26260

On the remaining node.


(hrishikesh srivastava) #7

I tried following combinations, but still node 1 having different cluster ID:

cockroach start --insecure --store=node1 --listen-addr=172.16.23.29:26257 --http-addr=172.16.23.29:7080 --join=172.16.23.29:26258,172.16.23.29:26259,172.16.23.26:26260

cockroach start --insecure --store=node2 --listen-addr=172.16.23.29:26258 --http-addr=172.16.23.29:7081 --join=172.16.23.29:26257,172.16.23.29:26259,172.16.23.26:26260

cockroach start --insecure --store=node3 --listen-addr=172.16.23.29:26259 --http-addr=172.16.23.29:7082 --join=172.16.23.29:26257,172.16.23.29:26258,172.16.23.26:26260

cockroach start --insecure --store=node4 --listen-addr=172.16.23.26:26260 --http-addr=172.16.23.26:7083 --join=172.16.23.29:26257,172.16.23.29:26258,172.16.23.29:26259

(hrishikesh srivastava) #8


(Ron Arévalo) #9

Hey @rishi2019,

Thanks for the screenshots, by any chance are you using a process manager to restart this cluster, or did you happen to remove the data directory when you shut down the cluster?

A single node joining an individual cluster with a different id from the rest of our nodes would only happen in the event of:

  1. A loss of the data directory and starting without a --join flag (best practice is to always use the --join flag).
  2. Manually starting the cluster, and then restarting it with a process manager.

Aside from those two instances above, a node would not be able join a different cluster on restart.

Thanks,

Ron


(hrishikesh srivastava) #10

Hi @ronarev,

I can see two locations for node data directory on server 172.16.23.29:

  1. /root/node1 ( in log cluster ID: 88e1e349-1f97-4ac8-83b0-32e38d3e3402)
  2. /usr/rishi/node1 ( in log cluster ID: b128e523-7abc-4a29-8c20-4d119799f83d)

My cockroach installer is located at /usr/rishi/ folder. Not sure why two different node1 directory created. however other two nodes node2 and node3 having location /root

If executing ‘cockroach start’ command to start node 1 it starts #1 node, but I suppose that it should start #2 node which is having same cluster id as node2 and node3 etc.

to start node 1 used following command as suggested:

cockroach start --insecure --store=node1 --listen-addr=172.16.23.29:26257 --http-addr=172.16.23.29:7080 --join=172.16.23.29:26257,172.16.23.29:26258,172.16.23.29:26259,172.16.23.26:26260

Is there way to start node1 from location /usr/rishi/node1?


(hrishikesh srivastava) #11

Well, now I am able to start node from location/usr/rishi/node1 by providing node1 path in --store section:

root@172.16.23.29:~ r-xr-x— # cockroach start --insecure --store=/usr/rishi/node1 --listen-addr=172.16.23.29:26257 --http-addr=172.16.23.29:7080

I think its better to provide more specific names of nodes rather then providing node1, node2, node3 etc.