Understanding under-replicated ranges

I’m doing some basic evaluation of CockroachDB and trying to understand some of the replication behavior. I apologize if this is covered in the docs or other posts but I did not find an answer there.

I initially had a cluster with 5 nodes, n1 - n5, with the default replication factor of 3. I started a sysbench “insert” workload, just to have a small write load running. After a few minutes, I gracefully shut down nodes n4 and n5. Cockroach quickly marked 35 ranges as under-replicated (which makes sense, as two nodes are now “suspect”). Five minutes later, there were only 33 under-replicated ranges, and it never got lower than this.

I know the documentation says

The number of failures that can be tolerated is equal to (Replication factor - 1)/2. Thus CockroachDB requires (n-1)/2 nodes to achieve quorum. For example, with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on.

Since the replication factor was only 3, it makes sense that we can’t necessarily tolerate two failures. But before I realized this, I dug into one of the under-replicated ranges and was surprised by several things. I think it’ll be easiest to just explain what I saw and then ask the questions. First, here’s what I did:

Initial condition: have set up CockroachDB cluster with 5 nodes, n1 - n5.
17:27:15Z: begin sysbench "insert" workload
17:35:15Z: shut down nodes n4, n5 (gracefully)
17:36   Z: cockroach has marked 35 ranges as under-replicated
           (makes sense: two nodes are now "suspect")
17:41   Z: now only 33 under-replicated ranges
	   (makes sense that it went down: it declared the two suspect nodes
	   dead and moved some ranges)

Like I said, I was initially surprised that so many ranges were stuck under-replicated, so I dug into an arbitrary one. Here are screenshots from the “Advanced Debug” page for that range: parts 1, 2, 3, 4. There’s also a screenshot of the node status to show when these nodes went offline.

As I interpret the log for this range, it looks like this:

15:43 (long before all this): looks like the range is on n1, n2, and n3
17:51:26Z: begin adding n4 because of rebalance
17:51:26Z: begin removing n3 because rebalance (that seems weird)
17:51:26Z: seem to be related to adding n4 again (VOTER_INCOMING vs. LEARNER)
17:51:26Z: removed n3 ("abandoned learner replica")
18:01:24Z: begin adding n5 because range under-replicated
18:01:24Z: finish adding n5
18:01:24Z: begin adding n3 as a replica because range under-replicated
18:01:24Z: finish adding replica n3

The troubleshooting docs referred me to the simulated allocator log, which you can see in screenshot 3 above. It says:

error simulating allocator on replica [n3,s3,r3/6:/System/{NodeLive…-tsd}]: 0 of 3 live stores are able to take a new replica for the range (3 already have a replica); likely not enough nodes in cluster

So here’s what I don’t understand about all that:

  1. Why is this range considered “under-replicated” at all? As far as I can tell from the report, it has three replicas, one on each of the remaining available nodes. Relatedly, it seems contradictory that there could be no “live stores able to take a new replica” because all of them already have a replica (and given that there are as many live stores as the replication factor).
  2. n4 and n5 were “suspect” by 17:36Z and “dead” by 17:41Z. Why did CockroachDB decide at 17:51Z to rebalance ranges from n3 onto these dead nodes? Does it not take into account that a node is dead before rebalancing?
  3. How is it possible that the replication apparently succeeded for n5 when that node was offline?
  4. Why is that that the latest range descriptor in the log has all five nodes in it, but we only see three columns in the range report? Are there really five replicas and we don’t see those columns because the other two nodes are down?

My guess about #1 is that based on this PR:

Note that this change interestingly means that a range can be considered both under-replicated and over-replicated at the same time - if there’s too many replicas, but sufficiently many of them are dead.

In other words, maybe this is under-replicated not because there aren’t 3 (the replication factor), but because there are five, but two of them are on dead nodes? If that’s true, is there operationally a way to distinguish between replicas that are under-replicated because they’re under the replication factor vs. under-replicated because there are some dead replicas? Relatedly, is there a way to know operationally how many under-replicated ranges are not making forward progress (e.g., because they require another node to be up)?

I know this was a long post, and I’m sure I got a bunch of details wrong. I’d appreciate any help understanding this. Thanks in advance!

Sorry to reply to my own post. Taking a closer look at this, I found that the timestamps in the Range Log are all from the day before I did the testing that I described. That answers most of my questions. I also asked:

Why is that that the latest range descriptor in the log has all five nodes in it, but we only see three columns in the range report? Are there really five replicas and we don’t see those columns because the other two nodes are down?

I don’t have more information about this, but I suspect my guess is right there, that it just doesn’t show columns from nodes that are down. When I brought up n4 and n5 again, the columns showed up. When I temporarily bring down n4 again, the column disappears again. When I bring it back, the column comes back.

In other words, maybe this is under-replicated not because there aren’t 3 (the replication factor), but because there are five, but two of them are on dead nodes? If that’s true, is there operationally a way to distinguish between replicas that are under-replicated because they’re under the replication factor vs. under-replicated because there are some dead replicas?

The premise for this question is no longer valid – the replication factor was 5 because this was a system range.

This does raise a few more questions for me:

  1. Does it make sense that there was nothing in the Range Log after we lost two nodes? Is that because there was nothing CockroachDB could do about it?
  2. What’s the best way to determine what zone a particular range is part of (so that I can determine the replication factor)? It doesn’t seem to be on that range report page.
  3. What’s the best way to determine what data is stored on a particular range? I see that if I already know what database/table it’s for (e.g., from show ranges for ...), then I can look at show zone configurations and figure this out, but is there a more direct way? The main use case would be to understand what’s impacted when a range is under-replicated or unavailable.

Thanks!