Under-replicated ranges after decommission

I’ve read through pretty much all the other “under-replicated” topics here and don’t see a resolution to the following issue. Which I’ve seen on 2 clusters so far.

Cockroach v20.2.3

In both clusters I’m seeing ranges that report existing twice on the same node, where one is the follower and the other is the leader and reports itself as underreplicated. Even after manually enqueueing the range and a rolling restarting the situation doesn’t change. See screenshots of an example range below.

In both cases it happened after a rolling restart, though the circumstances of that restart were different. One involved adding locality flags to the cluster, the other was decommissioning of a dead node and replacing it.

Any input on what to investigate in this kind of situation would be great.

After 5 days these ranges are still showing as under-replicated.

Hi, I never did hear back about this one and have yet another cluster which is having problems decommissioning because of ranges that live on two stores of the same node (which I thought was impossible). In this new case I haven’t actually stopped any nodes yet, they are live and in a decommissioning state. But if I stop the node, the cluster shows underreplicated ranges, and never actually replicates those ranges, even though there are 2 replicas out there and one of those has the lease.

When I try to force it to replicate through enqueueRange I get this:

Error: removing n1,s2 which is not in r571:/Table/54/1/<key1>-<key2} [(n1,s5):1, (n18,s51):2, (n7,s20):3, (n1,s2):4LEARNER, next=5, gen=130, sticky=1612382549.007632123,0]

2021-02-11 20:34:57	
kv/kvserver/store.go:2761 [n7,s20,r571/3:/Table/54/1/"was{eda.…-hing…}] running replicate.shouldQueue
2021-02-11 20:34:57	
kv/kvserver/store.go:2763 [n7,s20,r571/3:/Table/54/1/"was{eda.…-hing…}] shouldQueue=true, priority=12001.000000
2021-02-11 20:34:57	
kv/kvserver/store.go:2769 [n7,s20,r571/3:/Table/54/1/"was{eda.…-hing…}] running replicate.process
2021-02-11 20:34:57	
kv/kvserver/replicate_queue.go:343 [n7,s20,r571/3:/Table/54/1/"was{eda.…-hing…}] next replica action: remove learner
2021-02-11 20:34:57	
kv/kvserver/replicate_queue.go:799 [n7,s20,r571/3:/Table/54/1/"was{eda.…-hing…}] removing learner replica (n1,s2):4LEARNER from store
2021-02-11 20:34:57	
kv/kvserver/store.go:2771 [n7,s20,r571/3:/Table/54/1/"was{eda.…-hing…}] processed: false

I also see “W210211 21:10:14.530435 1639 kv/kvserver/store_rebalancer.go:223 ⋮ [n5,s11,store-rebalancer] StorePool missing descriptor for local store” in the logs, if that matters

Hi @dkinder,

Thanks for the report, and apologies for the delayed response. Let’s track this over in kvserver: rebalancing between stores on the same node fails · Issue #60545 · cockroachdb/cockroach · GitHub. We were able to reproduce the issue in a unit test and have a fix that should resolve that state.