Amigos, I’d like to consult on something that’s keeping me up: how to improve the
RangeDescriptorCache such that its entries are not clobbered by stale descriptors.
What’s the RangeDescriptorCache?
Each node maintains a cache of descriptors, used by the
DistSender for routing command batches. The cache contains non-overlapping descriptors in a tree with a LRU eviction policy. Whenever a key cannot be resolved with an entry in the cache, a meta2 range is scanned and the result is inserted in the cache. Whenever an entry from the cache is found to be stale (by a request to the respective range returning an “invalid range” error), the entry is evicted from the cache.
Reading of the meta2 range is done thorugh an inconsistent scan, grabbing potentially stale committed descriptors, potentially not-yet-committed or even aborted intents, and incoherent combinations thereof. These scans generally also try to prefetch descriptors covering ranges beyond the requested one.
This inconsistent scan is done (as far as I know) because:
- so that, during a split, instead of waiting for the split transaction to complete, results can be returned quickly to the caller. In theory, both the committed descriptor and the split intents will be returned; the committed one will be tried first and, if it doesn’t work, it is auto-magically replaced in the cache by the intent through a
- besides being an optimization, the inconsistent scan is actually necessary for the intent resolution process to work. When someone tries to cleanup a split intent, it needs to go to the txn record, which lives on a post-split range. So the intent descriptor version is necessary for resolving that.
Need for improvement
In #10751 I’ve observed a number of situations in which correct cached descriptors are clobbered by old, stale copies. This all stems from the inconsistent scan returning a bag of incoherent descriptors and the problem is exacerbated by a) the fact that our cache only contains non-overlapping descriptors, and so when we insert descriptors in this cache, we remove all overlapping ones and b) we try to prefetch ranges and so each time the cache is updated with large swaths of the key space, therefore potentially clobbering a lot. More details in the bug.
I think this is likely a real problem in production, although it corrects itself, and is a major pain in the ass for DistSQL tests, which want to set up a specific range topology. Once this topology is setup, various random async events clobber the cache at unpredictable times, creating chaos.
The instinct for fixing #10751 says “let’s just insert coherent sets of descriptors in the cache and not let old descriptors clobber newer ones”. The problem, though, seems to be a fundamental one - the inconsistent scan can generally return an arbitrary set of incoherent descriptors. Neither the committed descriptors, nor the intents, in isolation, are guaranteed to be coherent (easy to see when you think of intents being resolved in an arbitrary order). More, there can be arbitrarily many coherent sets between the committed and the intents (you can have multiple generations of intents). To top it off, “the truth” might not be returned by the scan at all, since (if we allow merges) the latest descriptor can be arbitrarily far from the beginning of the scan, and so a scan bound by the number of committed records it returns might not get to it.
So I don’t quite know what to do about all this. The only guarantee seems to be that, with a sufficiently large scan, the union of the returned committed+intents covers the scanned range. Another simple observation is that any intent or committed version overlapped by a newer committed one can be discarded.
It seems tempting to try to phrase the problem finding a coherent set of descriptors as a graph path or covering problem, but I got nowhere (particularly when keeping in mind the restriction that an intent representing a txn that has been committed has to be somehow included in the results to keep the intent resolution working).
One way forward is a maximal approach - turn the range descriptor cache into an interval tree allowing (any number of) overlapping descriptors, and incorporate all the (possibly-incoherent) information we have in it. Whenever a descriptor proves invalid, remove it. Whenever a descriptor proves valid**, remove overlapping ones. Whenever a new committed descriptor arrives, remove all older overlapping ones. Talking to Radu this seemed the more sane approach.
The problem with this is that it makes iterating through the cache weird, and clients (
DistSender) need to be prepared to handle overlapping descriptors when splitting a key span.
Another way might be keeping the cache non-overlapping, doing whatever heuristics to determine what to put in it, and having a different mechanism for resolving intents sitting in meta-ranges - we wouldn’t use the cache but instead do something different for the point lookups required to find the transaction record.
** not completely clear what mechanism we could use for declaring a descriptor valid, since “the routing just happened to work” for a request is not the same as the descriptor being true.