Friends, I’d like to brainstorm with the class. I want to figure out if what the crux of 9523 was doing is still needed. I think we have the following problem, with today’s code, that can lead to the promise of a lease not being used after it’s transferred being broken.
- A lease transfer is initiated. This eventually calls
pendingLeaseRequest.InitOrJoinRequest(), which constructs a batch with
TransferLeaseRequestin it and calls
_, pErr := replica.Send(ctx, ba)
- We then block on that
replica.Send(). During this time, the lease is in a “being transferred” mode (
true) and thus is not used.
- Once the
replica.Sendcall returns, we consider this “transfer” stage done, regardless of whether we got an error or not. Whatever
leasethe replica has now, it’s usable. In the case of
Send()returning success, the lease has not been replaced with the transferred one, so that’s cool. But what about the error case? The current behaviour assumes that an error implies that the transfer will not apply. But is that true, or is the error ambiguous? One case I’m thinking of is if the
ctxthat we use for
Send()has a timeout - it expires, the client gets an error, but the wheels are already in motion and the command will apply. And even besides the context timeout, one would guess that there’s inherent ambiguity in other errors.
Assuming the ambiguity is real, the risk is that we’ll start using (again) a lease that is still in progress of being transferred, and thus breaking the promise the lease holder has made when initiating the transfer that it’s not gonna use the lease again.
And so the question is what to do about it? I guess the options are:
- Introduce yet another bit of state to keep the lease in a mode where we refuse to use it, besides the existing state: the “transfer in progress” maintained by
pendingLeaseRequestand the bit being introduced in #10211 which protects from using an old lease after a restart.
- Keep the approach that was being attempted in #9523: instead of blocking on the
replica.Sendcall and considering the transfer done when that returns, we block on a channel pinged by the trigger of a lease command. So a transfer is done whenever a new lease is applied (the one requested by the
Send()call, or a completely different one).
I think I prefer 2) because I find it conceptually cleaner.
A related question that came up a while ago when looking at some code with @tamird is whether a call to
replica.Send() is guaranteed to return a result at all in all cases. Can someone confirm that this is supposed to be the case?