admin UI showing dead nodes

Downloaded the binary and started 3 nodes as per doc. ps -ax shows that they are all up and running . However, running admin ui for ip:8080 only shows one node and ip:8081 shows 2 nodes and tells one node is dead. what’s up? can some one explain if this is correct… it does not sounds like it is 07%20PM 15%20PM

Hey @samsam

Did you happen to restart any of these nodes, that second screenshot shows that uptime is only 17 minutes.

If you could take a look at the logs for that nodes and search for clusterID, they will probably show up as different from each other.

Also, if you could provide the start command that you issued, did you use the --join flag?

Thanks,

Ron

Ron,

I stopped the nodes and deleted all node1,2,3 folders and restarted the db and it seems to work now.

start node 1

cockroach start --insecure --listen-addr=172.31.35.124 &

start node 2

cockroach start
–insecure
–store=node2
–listen-addr=172.31.35.124:26258
–http-addr=172.31.35.124:8081
–join=172.31.35.124:26257 &

start node 3

cockroach start
–insecure
–store=node3
–listen-addr=172.31.35.124:26259
–http-addr=172.31.35.124:8082
–join=172.31.35.124:26257 &

Hey @samsam,

I would say it’s always a good idea to add the --join flag, even on the initial node.

Also, it looks like you may have restarted this node in a different directory, I was able to replicate this by doing the following:

  1. Stop the first node using cockroach quit --insecure --host=localhost:25267
  2. Changed directories and run cockroach start --insecure --listen-addr=localhost

This resulted in seeing one dead node on the original cluster, and one new cluster with just one node.

I would suggest stopping the entire cluster, restarting it all in the same directory and adding the --join flag to the initial node as well.

Something like this cockroach start --insecure --store=node1 --listen-addr=localhost:26257 --http-addr=localhost:8080 --join=localhost:26257

Thanks,

Ron

Will try it again. Thanks ron

Wow… your option did not work so I reverted back to the way node1 was called and now I have issues, node1 folder never gets created and I keep getting errors: * WARNING: The server appears to be unable to contact the other nodes in the cluster. Please try:
*
I can not believe this DB is so buggy

here is partial logs if it helps:

W190419 18:17:37.847982 84 storage/store.go:1654 [n1,s1,r7/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r7: replica (n1,s1):1 not lease holder; lease holder unknown
W190419 18:17:37.859134 117 jobs/registry.go:316 unable to get node liveness: node not in the liveness table
I190419 18:17:38.303183 177 gossip/server.go:232 [n1] received initial cluster-verification connection from {tcp 172.31.35.124:26259}
W190419 18:17:38.790792 84 storage/store.go:1654 [n1,s1,r7/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r7: replica (n1,s1):1 not lease holder; lease holder unknown
W190419 18:17:38.876075 103 storage/replica_range_lease.go:470 can’t determine lease status due to node liveness error: node not in the liveness table
W190419 18:17:39.307650 116 storage/node_liveness.go:558 [n1,hb] slow heartbeat took 4.5s
W190419 18:17:39.307668 116 storage/node_liveness.go:494 [n1,hb] failed node liveness heartbeat: context deadline exceeded
W190419 18:17:39.751614 84 storage/store.go:1654 [n1,s1,r7/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r7: replica (n1,s1):1 not lease holder; lease holder unknown
I190419 18:17:40.303330 157 gossip/server.go:232 [n1] received initial cluster-verification connection from {tcp 172.31.35.124:26258}
I190419 18:17:40.303417 177 gossip/server.go:232 [n1] received initial cluster-verification connection from {tcp 172.31.35.124:26259}
W190419 18:17:40.728003 84 storage/store.go:1654 [n1,s1,r7/1:/Table/{SystemCon…-11}] could not gossip system config: [NotLeaseHolderError] r7: replica (n1,s1):1 not lease holder; lease holder unknown