Howdy,
I’m wondering if anyone has any insight on the following question: why do we keep read commands in the command queue while they’re in flight, and insert them in the timestamp cache after they’re done, as opposed to inserting them in the timestamp cache in the beginning?
As I understand it, the point of the command queue is to synchronize reads with writes, ensuring that a read does not miss the effects of an in-flight write. This is accomplished by having the read block on writes present in the command queue (I’m not suggesting any changes to that). The other side of the coin - assuring that a write does not write values “under” a read - is mostly assured by the timestamp cache. Right now, however, we only insert a read in the ts cache upon its successful completion. So, while the read is in-flight, we keep the read in the cmd queue and block writes on it.
I’d like to know if there’s a good reason for not putting the read in the ts cache before starting its execution, and not keeping it in the cmd queue at all. This would have several benefits:
- I would find it clearer
- Writes at ts 100 running concurrently with reads at ts 99 would not have to wait.
- Range lease transfers could simply look at the timestamp cache to figure out the highest timestamp at which a read has been served, or will be served by the replica. This high-water mark could then act as the low-water mark of the transfer recepient’s ts cache.
- Spencer had some relatively-simples ideas a while ago about how non-leaseholder replicas could serve some reads with minimal coordination with the leaseholder (basically by asking the leaseholder to do cmd queue validation and to return a raft index). The fact that reads update some replica structures at the beginning and others at the end makes something like this harder to implement.
One downside that I see of the proposed change is that we’d be adding a read to the timestamp cache even if it fails. Also, in case of retried transactions, we’d be adding it at the initial timestamp, instead of the higher final timestamp (and we like timestamps in the cahe to be as high as possible). These don’t seem like a big deal to me…
I’ve posed these conundrum to @spencer, who banged a small POC diff which seems to pass tests (not actually for review at this time):
https://github.com/andreimatei/cockroach/pull/3