What to do about invalid leases on the problem ranges report?

I have just installed cockroachdb using the helm chart and added about 27k rows to it, then I removed the helm release (helm del --purge name) and started it again helm install .... Now I see “Invalid leases” on the Problem Ranges report, but I don’t know what to do about it.

The cluster has the default 3 nodes and all the default settings in the helm chart other than the storage class (hostpath) and storage size (10Gi).

As far as I can tell all of the data is reachable. At least, I can run SELECT SUM(balance) FROM accounts and get what I believe is the correct value and no errors (I’ll be able to verify this once the lease errors are gone). I can insert data successfully.

Is this a normal situation that I can ignore? Can I assume that it will repair itself? If not, what steps do I need to follow to a) identify the exact nature of the problem and b) resolve the problem?

FWIW, I saw this in the docs:

https://www.cockroachlabs.com/docs/stable/architecture/replication-layer.html#leaseholder-rebalancing

but it’s been over 10 minutes (33 minutes) and the leases are still expired. Should the leaseholders have worked this out amongst themselves by now?

I also saw this thread, but it looks like it died without resolution: Problem Ranges Report -- What can be done about it?

On the Replication Dashboard I see that I have 21 ranges available, 21 leases, 13 lease holders, 0 leaders-without-leases/unavailable/under-replicated/over-replicated. There are indeed 8 ranges with “Invalid leases”.

My wild, mostly uneducated guess is that this happened because the cluster nodes started in a different order the second time around. The events log shows that the initial “Node Joined” order was 1 2 3, while the subsequent “Node Rejoined” order was 1 3 2. Would that have caused some of the ranges to get new leaders? That is, when 1 and 3 were present but 2 was not yet there, would the ranges whose leader was node 2 have been moved to 1 or 3? And if so, would that have caused these invalid leases to exist and persist?

For what it is worth, only nodes 1 and 3 have problem ranges, 2 (the last to start) has none. However, according to /_status/nodes, node 2 is the leader of no ranges, so that makes some sense.

Build Tag:    v19.1.2
Build Time:   2019/06/07 17:32:15
Distribution: CCL
Platform:     linux amd64 (x86_64-unknown-linux-gnu)
Go Version:   go1.11.6
C Compiler:   gcc 6.3.0
Build SHA-1:  cbd571c7bf2ffad3514334a410fa7a728b1f5bf0
Build Type:   release

Here are screenshots from the Range Reports for R2 (valid lease) and R15 (invalid lease): https://imgur.com/a/lT1PoMN . The values you see here are consistent with the rest of the ranges – all of the “invalid leases” are associated with “expired”.

Hey @dpk,

If you can grep your logs for the following:

grep -R 'This range is likely unavailable.' */logs/*

Because the nodes restarted, I think you may be running into a known issue that should be fixed in our next release.

Thanks,

Ron

Hi @ronarev,

I’ve checked the logs (kubectl logs, grep -r while exec’d in to the pods, and via the admin UI) and don’t see that message.

FWIW, I have this running on my laptop. The computer was sleeping for about 7 hours last night, and when it came back online, I noticed that 3 of the leases are no longer invalid. I checked one of the ranges that was invalid before and see that it is indeed valid (type epoch, lease epoch 2, lease state valid). I didn’t restart the nodes but they likely didn’t experience continuous time while the lid was shut – the runtime stats logs suggest they were alive roughly every 90 minutes – probably not relevant.

I just ran some more full-table selects and I didn’t see any errors. Should I have, if I’m hitting the same known issue? Also, could this issue affect data availability and/or accuracy?

I was just poking around some more, selecting from system tables, and found that another range disappeared from the problem range list. Is it possible my queries triggered some sort of lease-repair? If so, I guess this is not the sort of scenario you’d expect to last a long time under production load.

Hey @dpk,

kubectl logs are not the same thing as logs from the actual cockroach cluster. In order to get all the logs from the cluster itself, you need to run the debug zip command. Once you have those logs, you could grep to see if you see any error that say a range is unavailable, they should still be present in older log files even if all your invalid leases have since cleared up.

Thanks,

Ron

@ronarev, Thanks for that command. I’ve generated the debug.zip file for each of the nodes, copied them out of the pods, and grepped through the results, but I don’t see any “likely unavailable” lines. There are crdb_internal.kv_store_status.txt files which seem to indicate that no ranges are currently unavailable ("ranges.unavailable": "0") while there are still 3 ranges with invalid leases.

The more I read these logs and the docs the more I think that these last few ranges are just going to remain as-is until there’s some activity on them. I’m not sure there’s any data currently living in the ranges, or if there is, where it is.

Do you have the link to that issue you were referring to earlier?

Hey @dpk,

Glad that link helped for getting logs! Also, here is the Github issue I was referring to. But if you don’t see any likely unavailable errors in your logs, then you are most likely not hitting this error.

Thanks,

Ron

Hey @dpk

Just wanted to share an a quick update. Seeing “invalid lease” on the problem ranges page is relatively normal and shouldn’t be cause for concern. It usually happens after a node restarts and will eventually clear on its own but can take up to 24h on idle clusters.

We have filed an issue on that here.

Let me know if you have any other questions.

Thanks,

Ron

Thanks @ronarev, I appreciate the follow-up.