Under-Replicated Ranges

My cluster has 216 of 218 ranges under replicated and I can see in the dashboard there are constant queue processing failures. It also says 216 replicas are in purgatory. I don’t see anything about this in my log files so its really hard to understand what is going wrong. Not really sure where to start solving this.

Thanks

hi @ozzieisaacs, to start a diagnosis, we’re going to need to know a bit more about your cluster.

How many nodes?
Can they all talk to each other?
What version are you running?

Also, to provide a bit more color: replicas go into purgatory when they can’t upreplicate after a significant number of tries. The most likely cause is that there’s a network partition preventing some nodes from communicating with other nodes, hence Bram’s questions about whether or not they can talk to each other. The other questions about cluster setup are still relevant, so let us know if you have any questions.

+----+---------------------+--------+---------------------+---------------------+---------+------------------+-----------------------+--------+--------------------+------------------------+
| id |       address       | build  |     updated_at      |     started_at      | is_live | replicas_leaders | replicas_leaseholders | ranges | ranges_unavailable | ranges_underreplicated |
+----+---------------------+--------+---------------------+---------------------+---------+------------------+-----------------------+--------+--------------------+------------------------+
|  1 | 192.168.8.91:26257  | v2.0.2 | 2018-08-06 14:57:59 | 2018-08-03 21:11:25 |    true |              203 |                   203 |    218 |                  0 |                    216 |
|  2 | 192.168.8.90:26257  | v2.0.2 | 2018-08-06 14:58:04 | 2018-08-03 21:07:43 |    true |               15 |                    13 |    218 |                  0 |                      0 |
|  3 | 192.168.10.90:26257 | v2.0.2 | 2018-08-06 14:58:04 | 2018-08-03 21:47:12 |    true |                0 |                     0 |    218 |                  0 |                      0 |
+----+---------------------+--------+---------------------+---------------------+---------+------------------+-----------------------+--------+--------------------+------------------------+

I can run this command from each node so they don’t seem to have a problem connecting to each other. It also seems possible to query each node and they return the same data. Is there anywhere I can find logs that would help me see the problem?

Hey @ozzieisaacs - the fastest way to collect logs across active nodes is to use cockroach debug zip. If you run that and send us over the output, we can take a look at the logs.

Is there an email address or something I can send the output to?

Sure, tim@cockroachlabs.com should work.

For posterity’s sake:

The issue here ended up being zone configuration. The issue ended up being that the default zone had num_replicas set to 5, while there were only three nodes. This caused nearly all replicas to be under-replicated, and thousands of replication queue failures in the admin UI. Zone configs can be listed and adjusted using the instructions here: https://www.cockroachlabs.com/docs/v2.0/configure-replication-zones.html#list-the-pre-configured-replication-zones