RangeDescriptorCache and uncertainty

Amigos, I’d like to consult on something that’s keeping me up: how to improve the RangeDescriptorCache such that its entries are not clobbered by stale descriptors.

What’s the RangeDescriptorCache?

Each node maintains a cache of descriptors, used by the DistSender for routing command batches. The cache contains non-overlapping descriptors in a tree with a LRU eviction policy. Whenever a key cannot be resolved with an entry in the cache, a meta2 range is scanned and the result is inserted in the cache. Whenever an entry from the cache is found to be stale (by a request to the respective range returning an “invalid range” error), the entry is evicted from the cache.
Reading of the meta2 range is done thorugh an inconsistent scan, grabbing potentially stale committed descriptors, potentially not-yet-committed or even aborted intents, and incoherent combinations thereof. These scans generally also try to prefetch descriptors covering ranges beyond the requested one.
This inconsistent scan is done (as far as I know) because:

  1. so that, during a split, instead of waiting for the split transaction to complete, results can be returned quickly to the caller. In theory, both the committed descriptor and the split intents will be returned; the committed one will be tried first and, if it doesn’t work, it is auto-magically replaced in the cache by the intent through a EvictionToken.
  2. besides being an optimization, the inconsistent scan is actually necessary for the intent resolution process to work. When someone tries to cleanup a split intent, it needs to go to the txn record, which lives on a post-split range. So the intent descriptor version is necessary for resolving that.

Need for improvement

In #10751 I’ve observed a number of situations in which correct cached descriptors are clobbered by old, stale copies. This all stems from the inconsistent scan returning a bag of incoherent descriptors and the problem is exacerbated by a) the fact that our cache only contains non-overlapping descriptors, and so when we insert descriptors in this cache, we remove all overlapping ones and b) we try to prefetch ranges and so each time the cache is updated with large swaths of the key space, therefore potentially clobbering a lot. More details in the bug.
I think this is likely a real problem in production, although it corrects itself, and is a major pain in the ass for DistSQL tests, which want to set up a specific range topology. Once this topology is setup, various random async events clobber the cache at unpredictable times, creating chaos.

The problem

The instinct for fixing #10751 says “let’s just insert coherent sets of descriptors in the cache and not let old descriptors clobber newer ones”. The problem, though, seems to be a fundamental one - the inconsistent scan can generally return an arbitrary set of incoherent descriptors. Neither the committed descriptors, nor the intents, in isolation, are guaranteed to be coherent (easy to see when you think of intents being resolved in an arbitrary order). More, there can be arbitrarily many coherent sets between the committed and the intents (you can have multiple generations of intents). To top it off, “the truth” might not be returned by the scan at all, since (if we allow merges) the latest descriptor can be arbitrarily far from the beginning of the scan, and so a scan bound by the number of committed records it returns might not get to it.

So I don’t quite know what to do about all this. The only guarantee seems to be that, with a sufficiently large scan, the union of the returned committed+intents covers the scanned range. Another simple observation is that any intent or committed version overlapped by a newer committed one can be discarded.
It seems tempting to try to phrase the problem finding a coherent set of descriptors as a graph path or covering problem, but I got nowhere (particularly when keeping in mind the restriction that an intent representing a txn that has been committed has to be somehow included in the results to keep the intent resolution working).

One way forward is a maximal approach - turn the range descriptor cache into an interval tree allowing (any number of) overlapping descriptors, and incorporate all the (possibly-incoherent) information we have in it. Whenever a descriptor proves invalid, remove it. Whenever a descriptor proves valid**, remove overlapping ones. Whenever a new committed descriptor arrives, remove all older overlapping ones. Talking to Radu this seemed the more sane approach.
The problem with this is that it makes iterating through the cache weird, and clients (DistSender) need to be prepared to handle overlapping descriptors when splitting a key span.

Another way might be keeping the cache non-overlapping, doing whatever heuristics to determine what to put in it, and having a different mechanism for resolving intents sitting in meta-ranges - we wouldn’t use the cache but instead do something different for the point lookups required to find the transaction record.

Does anybody have any thoughts? cc @radu @bdarnell @nathan

** not completely clear what mechanism we could use for declaring a descriptor valid, since “the routing just happened to work” for a request is not the same as the descriptor being true.

Until fairly recently (#9416), we would either prefetch additional ranges or consider intents, but not both, so there was less incoherent data entering the range cache. We got rid of that because in practice we were nearly always in “consider intents” mode (we would always use that mode after a cache entry had been evicted, which was the only time we ever did range lookups after startup), but maybe it would be worth bringing it back if we could more precisely distinguish cases when an intent would be likely to help.

As another historical note, we were originally very bad at resolving intents on meta ranges, so reading intents may have been more necessary then. The motivation for using inconsistent reads was entirely #2 (it was necessary to find transaction records after splits and replica changes); any performance benefit was secondary and I don’t think we discussed it at the time.

I’d prefer to move towards limiting our use of range descriptors from intents, instead of making the range cache more complicated to store them.

I thought the transaction record for the split operation was anchored to the left-hand side of the split for the exact reason that you could use either the pre-split or post-split range descriptor for the left-hand side in order to locate the split intent.

I agree with Ben that if possible, limiting our use of range descriptors from intents would be ideal. If we could handle the cases where inconsistent reads are required for correctness outside of the RangeDescriptorCache, then we could simply deal with coherent descriptors returned from consistent scans within the cache. A stale descriptor would result in evicting that descriptor and performing a new lookup to replace it with the correct one, without any of the current trial-and-error. It would also mean that prefetching would become more useful again. This would massively simplify the logic internal to the RangeDescriptorCache.

This seems to fit with your second option of resolving intents in meta-ranges using a different approach. Could you expand more on how you see this working? Would this process now need to perform its own series of inconsistent RangeLookups for intent resolution after every split?

Yeah, the transaction record for a split now lives on the LHS so I don’t think the inconsistent lookup is needed after a split. (But maybe there are issues if we are pushing multiple transactions at once and they appear on both the LHS and RHS of the split. I’m not sure whether that would sort itself out currently or not)

It’s still needed for replica changes, though (if a newly-added node becomes leader, you need to be able to find it). There may be alternatives for replica changes, though: we could perhaps return a range descriptor in NotLeaderError just like we do for RangeKeyMismatchError, or just trust the replica information int he NotLeaderError even if it’s from a replica not present in our version of the descriptor.

Yes, I certainly think that we should trust the information in NotLeaderError above the descriptor version we have, and it’s bad that we don’t do that now.
I was thinking more along the lines of using information from the intent to route the PushTxnRequest, instead of using the range cache desc for routing. But one thing that wasn’t apparent to me before is that you can’t have multiple overlapping descriptor intents in the meta ranges, since we don’t allow multiple intents on one key. So a 2nd descriptor change must resolve the previous change, meaning that descriptors in intents can’t be too off (say, in the list of replicas) from reality. So maybe relying more on information from NotLeaseHolderError is a good idea.

I’d like to understand better how this fact that the txn record lives on the LHS of a split is important. How does that make it ok to use the pre-split desc to find it? Or, how does it make it more ok than if it were living on the RHS? And also isn’t the problem symmetric with resolving the RHS intent on the meta range?
I can imagine how having the txn record living on the LHS might help with the fact that the new Raft group for the RHS is in a weird state at the time when we’re likely attempting to resolve that intent, though. But I don’t quite see the connection to routing.

So a 2nd descriptor change must resolve the previous change, meaning that descriptors in intents can’t be too off (say, in the list of replicas) from reality.

Yes, an inconsistent read of the range descriptor (without returning intents) will never be more than one change behind the true value.

I can imagine how having the txn record living on the LHS might help with the fact that the new Raft group for the RHS is in a weird state at the time when we’re likely attempting to resolve that intent, though. But I don’t quite see the connection to routing.

The connection to routing is that either the pre- or post-split range descriptor points to the range ID of the LHS range, so both descriptors will work (and there is no requirement to look at the intent value)

I think this sentence is a bit vague, as it doesn’t suggest the following possibility:

A                B     B1       C 

Meta range:
B: AB-post-split?
C: pre-split, BC-post-split?

This diagram shows the range [A-C) that was recently split at B, and the state of the meta range when the split txn commits (intents have a ?).
After the split at B, the pre-split descriptor present at meta© can be arbitrarily stale wrt range [A-B), right? I.e. the meta intent at B could have been resolved and the range [A-B) could have moved around 100 times and also split again and again, all while the committed desc at meta© is still the pre-split one.
So if someone does a RangeLookup for key B1 and they get the pre-split desc from C, that information can be totally useless.

I think the invariants that we have are:

  1. Intents in the meta ranges are not overlapping other intents.
  2. An intent corresponding to a committed transaction is not stale.
  3. A committed desc present at a meta key without any intents is not stale.
  4. A committed desc present at a meta key that also has an intent corresponding to an aborted or pending transaction is not stale.
  5. An intent that doesn’t overlap with any committed descriptors has to correspond to a committed value.

So I think this means that if you scan from a key k forward, either the first intent you encounter or the first committed value you encounter will be the true descriptor for k (assuming they both cover it).

This suggests that the following recursive algorithm for pushing a txn record is correct:
To resolve the range for a key, we do an inconsistent scan and resolve any intents that we find. Resolving the intents is done by recursively finding the range with the txn record and sending a PustTransactionRequest to that range. It doesn’t recurse more than one level because we arrive at an intent that covers its txn record - a case that we handle specially by trying to send the push to both the range described by the intent and the following committed record. One of them has to be the truth.

Something like:

resolve_key(k) {
  committed, intents = inconsistent_scan(meta(k))
  if committed[0].end_key < intents[0].end_key {
    // Easy case: committed with no intents on it.
    assert committed[0].covers(k)
    return committed[0]

  intent, committed_desc = nil
  if committed[0].covers(k) committed_desc = committed[0])
  if intents[0].covers(k) intent  = intents[0])

  if intent != nil {
    return resolve_meta_write_intent(intent, committed_desc)
  return committed

// Resolve an intent. committed is a descriptor whose range contains
// intent's range.
// Returns the true descriptor covering the intent's key.
resolve_meta_write_intent(intent, committed) {
  txn_rec_key = intent.txn.Key
  if !intent.covers(txn_rec_key) {
    // recursively resolve the transaction key. The recursion will stop because
    // the txn key will be covered by the first intent or descriptor after it 
    // in the meta range.
    txn_rec_range = resolve_key(txn_rec_key)
    status, err = push_meta_txn_with_desc(txn_rec_range, txn_rec_key)
  } else {
    // The intent descriptor covers the txn record.
    status, err = push_meta_txn_with_desc(intent, txn_rec_key)
    if err == RangeKeyMismatch {
      assert(committed != nil && committed.covers(txn_rec_key))
      status, err = push_meta_txn_with_desc(committed, txn_rec_key)
      if err != nil {
        return err
  if status == COMMITTED || status == ABORTED {
    // resolve the intent so we don't run into it in the future
  if status == COMMITTED {
    return intent
  return committed

push_meta_txn_with_desc(desc, txn_key) {
  pushReq = roachpb.PushTxnRequest{
    Span: roachpb.Span{
      Key: txn_key,
    PushType: PUSH_TOUCH,
  status, err = dist_sender.send_directly_to_range(desc, pushReq)
  return status, err

Does this seem sane? It’s kind of similar in spirit with what we aim to achieve now with the cache, holding committed and uncommitted versions of a descriptor. It seems like everybody wants to move away from the intents in the cache business, so this solution aims to transform the LookupRangeRequest into a consistent read, without actually using a consistent read.

Right, the one-intent-at-a-time rule for single descriptors doesn’t provide any real guarantees in the presence of splits and merges.

I don’t think that’s true. Consider a range [A, D). A split is attempted at B, but the transaction aborts, leaving intents at meta(B) and meta(D). The intent at meta(D) could be resolved without resolving the one at meta(B), and then the range could be split again at C (successfully). Now we have intents at meta(B), meta©, and meta(D), and a committed value at meta(D). The intent at meta(B) is stale and we must continue to the intent at meta©. The transaction records for both splits are located at the range containing A, so we cannot resolve the intent at meta(B) without using the intent at meta©.

The pseudocode looks reasonable for handling splits (although I’m not clear on where exactly it would fit into the current code), but for ChangeReplicas, it would only use the intent and never the committed value. I think the simplest way to handle this is that if the intent and committed values have the same range ID, we use the union of their replicas in the ReplicaSlice. If the range IDs differ, then we try it with each range ID (either catching the RangeKeyMismatchError as you’ve done here or just trying both in parallel - PushTxn is idempotent and it’s fine if both end up working)