Few nodes rejoining very often in cockroachdb secure cluster

replication
deployment

(Swaroop) #1

Hi, I’m running a 10 node AWS EC2 c5.large(2 CPU, 4GB RAM) secure cockroachdb cluster through apache mesos and docker. The instances are running on debian jessie. 4 nodes are very unstable and joining the cluster every 15-20 minutes. cockroach process is getting killed for memory issue(Memory cgroup out of memory, killing cockroach container) on these affected nodes and there is always 4 under replicated ranges.
Tried adding 2 more node to the cluster and also upgrading 2 c5.large to c5.xlarge( 4 CPU and 8 GB RAM) instances, but didn’t resolve the issue.
From the dashboard, all the nodes seems to be healthy but when I run node status , I see is_live parameter false for few nodes. Unable to figure out the root cause.

±—±-----------------±-------±--------------------±--------------------±--------+
| id | address | build | updated_at | started_at | is_live |
±—±-----------------±-------±--------------------±--------------------±--------+
| 4 | 10.0.8.230:26257 | v2.0.4 | 2018-12-27 18:00:05 | 2018-12-27 17:58:14 | false |
| 6 | 10.0.8.13:26257 | v2.0.4 | 2018-12-27 18:00:04 | 2018-12-21 21:38:17 | false |
| 7 | 10.0.8.33:26257 | v2.0.4 | 2018-12-27 17:59:58 | 2018-12-26 18:21:04 | false |
| 8 | 10.0.8.50:26257 | v2.0.4 | 2018-12-27 18:00:02 | 2018-12-21 19:43:51 | true |
| 9 | 10.0.8.37:26257 | v2.0.4 | 2018-12-27 17:59:58 | 2018-12-21 18:54:28 | false |
| 10 | 10.0.8.121:26257 | v2.0.4 | 2018-12-27 18:00:03 | 2018-12-27 13:54:10 | true |
| 11 | 10.0.8.88:26257 | v2.0.4 | 2018-12-27 18:00:02 | 2018-12-27 17:56:41 | true |
| 13 | 10.0.8.63:26257 | v2.0.4 | 2018-12-27 18:00:05 | 2018-12-27 17:49:33 | true |
| 14 | 10.0.8.213:26257 | v2.0.4 | 2018-12-27 18:00:05 | 2018-12-27 13:54:12 | true |
| 15 | 10.0.8.110:26257 | v2.0.4 | 2018-12-27 18:00:02 | 2018-12-27 17:58:40 | true |
| 16 | 10.0.8.168:26257 | v2.0.4 | 2018-12-27 18:00:05 | 2018-12-27 13:54:11 | true |
| 17 | 10.0.8.43:26257 | v2.0.4 | 2018-12-27 18:00:00 | 2018-12-21 21:08:50 | false |
±—±-----------------±-------±--------------------±--------------------±--------+


(Ron Arévalo) #2

Hey Swaroop,

Can you send us over the debug zip so we can take a look to see what might be happening for those nodes. And from the screenshot it looks like it is nodes 1,2,3 and 5. is that correct. I’ve shared a Google Drive folder with the your email and you can upload the debug zip there.

Thanks,

Ron


(Swaroop) #3

Hi Ron,

Thanks for the response and sorry for the delay. The issue with nodes joining often has been resolved. There was a timeseries table with range size much higher than 64 MB that was causing issue. The cluster seems to be stable now and there are no under replicated ranges. But the node status command still shows few nodes(nodes with id: 4,6,7,9 and 17) is_liveliness to be false. Debug zip has been shared.


(Ron Arévalo) #4

Hi @swaroop,

The debug zip that you shared, was that before or after the cluster was stable? Are you still seeing is_liveness false for those nodes?

Thanks,

Ron


(Swaroop) #5

Hi Ron,

The debug zip is after the cluster was stable. Still seeing is_liveness false for nodes with id: 4,6,7,9 and 17.