Ambiguity in replica.Send() errors and how they affect lease transfers

@bdarnell @tschottdorf @peter

Friends, I’d like to brainstorm with the class. I want to figure out if what the crux of 9523 was doing is still needed. I think we have the following problem, with today’s code, that can lead to the promise of a lease not being used after it’s transferred being broken.

  • A lease transfer is initiated. This eventually calls pendingLeaseRequest.InitOrJoinRequest(), which constructs a batch with TransferLeaseRequest in it and calls _, pErr := replica.Send(ctx, ba)
  • We then block on that replica.Send(). During this time, the lease is in a “being transferred” mode (replicaPendingLease.TransferInProgress() returns true) and thus is not used.
  • Once the replica.Send call returns, we consider this “transfer” stage done, regardless of whether we got an error or not. Whatever lease the replica has now, it’s usable. In the case of Send() returning success, the lease has not been replaced with the transferred one, so that’s cool. But what about the error case? The current behaviour assumes that an error implies that the transfer will not apply. But is that true, or is the error ambiguous? One case I’m thinking of is if the ctx that we use for Send() has a timeout - it expires, the client gets an error, but the wheels are already in motion and the command will apply. And even besides the context timeout, one would guess that there’s inherent ambiguity in other errors.

Assuming the ambiguity is real, the risk is that we’ll start using (again) a lease that is still in progress of being transferred, and thus breaking the promise the lease holder has made when initiating the transfer that it’s not gonna use the lease again.
And so the question is what to do about it? I guess the options are:

  1. Introduce yet another bit of state to keep the lease in a mode where we refuse to use it, besides the existing state: the “transfer in progress” maintained by pendingLeaseRequest and the bit being introduced in #10211 which protects from using an old lease after a restart.
  2. Keep the approach that was being attempted in #9523: instead of blocking on the replica.Send call and considering the transfer done when that returns, we block on a channel pinged by the trigger of a lease command. So a transfer is done whenever a new lease is applied (the one requested by the Send() call, or a completely different one).

I think I prefer 2) because I find it conceptually cleaner.

A related question that came up a while ago when looking at some code with @tamird is whether a call to replica.Send() is guaranteed to return a result at all in all cases. Can someone confirm that this is supposed to be the case?

This is referring to #9523, right?

I think timeouts are the only ambiguous case here - the replica will loop internally and retry until it gets a definitive response. Replica.Send is guaranteed to return a result if and only if there is a deadline; it may retry for a very long time otherwise (the main cause of indefinite retries would be a replica that has been removed; this will loop until the replica GC destroys it)

Right, I’m referring to 9523 (I’ve initially wrote this as a comment on that PR, then pasted it here and messed up the context switch).

So what do you make of keeping the mechanism introduced in that PR - make the end of a transfer be driven by new lease applications, not by results of replica.Send?

You can also get an error if your command is reproposed and there’s an error during reproposal (but the original proposal could still apply). This shouldn’t happen in practice (and you could say it’s also a kind of timeout error), but in result I would say that we should not try to make it so that an error from Replica.Send guarantees that the command isn’t ever going to apply; there are too many ways in which this property could go away if we even managed to establish it.

Regarding your options:

  1. I just wrote some comments in #10211; perhaps what you’re trying to do here fits in there. TL;DR is “just wait out any uncertainty before you use the lease”.
  2. This suffers from the problem that you don’t want to signal that channel on every lease but only specific ones (#10211) and so you have to delocalize the logic, which I think is good to avoid.

What about

  1. When a transfer returns an error, retry it until you get exactly the error you expect, namely a LeaseRejectedError. Not a great option because we’ll certainly end up busy-looping on some unexpected errors, for example when the Replica is GC’ed (even if we can bound the time we busy loop for, absolutely want to avoid it).
  2. Introduce a AmbiguousProposalError (or just use the one @spencer is introducing, assuming that the semantics are compatible). Looks like we don’t have too many locations in Replica.Send that would need to return it, and the semantics seem easy to uphold (i.e. we won’t break it accidentally). OTOH, that raises the question of what you need to do when you do get one of these errors, and then it’s back to options 1-3, so this is looks really more like an optimization (though for the far common case).

I did some reflecting and talked to Tobi. Seems one of the ideas he had in #10211 would work for unifying the two cases in which we want to not use leases that are otherwise valid: “don’t use lease that’s being transferred” and “don’t use lease that hasn’t been refreshed after a restart".
The idea is that a replica would keep track of a minimum necessarily-covered ts - meaning that the current lease can only be used for reads and proposing writes if it covers this timestamp. This mechanism would let us enforce the property that only new leases that have been proposed from some moment onwards can be used (once applied).
On startup, this timestamp would be initialized to hlc.Now() + LeaseDuration. This means(*) that leases proposed by the old process and applied by the current one fail to qualify, but leases proposed by the current one would.
On initiating a transfer, we again set this timestamp to hlc.Now() + LeaseDuration. This means that extensions that have been proposed before initiating this transfer fail to qualify, extensions proposed after do.

This proposal seems to not have much downside. There’s no extra code required in the lease application path (so goodbye #9523). The ambiguity of whether a transfer has failed remains from the point of view of the pendingLeaseRequest, but c’est la vie; redirectOnOrAcquireLease() no longer replies on pendingLeaseRequest.TransferInProgress() for correctness. So pendingLeaseRequest can continue to do what it does today - blocks on replica.Send() when proposing a new lease and, in case of (ambiguous) error, considers that there’s no longer a request in progress, allowing others to request lease extensions or retrying a transfer.

@spencer please see if this sounds copacetic with the epoch-based leases. I think the mechanism can be ported forward.

(*) keeping in mind that (since recently) we wait on startup to guarantee that hlc.Now() is not in the past of the previous process’ dying `hlc.Now()

I’ve talked to Spencer and Carl. Seems like we need a small tweak for the future world of epoch-based leases. Since range leases are not going to have an expiration time any more, we can’t use the mechanism proposed above for ensuring that leases proposed in the past are not going to be used (hlc.Now() + LeaseDuration will be covered by any lease).
But what we can do is directly implement the check we want: add a proposedTimestamp to the Lease proto and maintain a per-replica minProposedTime. On both transfer and restart, we set the replica(s) minProposedTime = hlc.Now().
This would just work with epoch-based leases, no changes necessary.