Why do we keep read commands in the command queue?


I’m wondering if anyone has any insight on the following question: why do we keep read commands in the command queue while they’re in flight, and insert them in the timestamp cache after they’re done, as opposed to inserting them in the timestamp cache in the beginning?

As I understand it, the point of the command queue is to synchronize reads with writes, ensuring that a read does not miss the effects of an in-flight write. This is accomplished by having the read block on writes present in the command queue (I’m not suggesting any changes to that). The other side of the coin - assuring that a write does not write values “under” a read - is mostly assured by the timestamp cache. Right now, however, we only insert a read in the ts cache upon its successful completion. So, while the read is in-flight, we keep the read in the cmd queue and block writes on it.

I’d like to know if there’s a good reason for not putting the read in the ts cache before starting its execution, and not keeping it in the cmd queue at all. This would have several benefits:

  1. I would find it clearer :slight_smile:
  2. Writes at ts 100 running concurrently with reads at ts 99 would not have to wait.
  3. Range lease transfers could simply look at the timestamp cache to figure out the highest timestamp at which a read has been served, or will be served by the replica. This high-water mark could then act as the low-water mark of the transfer recepient’s ts cache.
  4. Spencer had some relatively-simples ideas a while ago about how non-leaseholder replicas could serve some reads with minimal coordination with the leaseholder (basically by asking the leaseholder to do cmd queue validation and to return a raft index). The fact that reads update some replica structures at the beginning and others at the end makes something like this harder to implement.

One downside that I see of the proposed change is that we’d be adding a read to the timestamp cache even if it fails. Also, in case of retried transactions, we’d be adding it at the initial timestamp, instead of the higher final timestamp (and we like timestamps in the cahe to be as high as possible). These don’t seem like a big deal to me…

I’ve posed these conundrum to @spencer, who banged a small POC diff which seems to pass tests (not actually for review at this time):

cc @spencer @bdarnell @tschottdorf @peter @asubiotto

One reason is that we’re not currently disciplined about performing reads with a single rocksdb snapshot, so we use the command queue to avoid any conflicting writes while reads are going on. But if we created a snapshot and used it for everything I don’t see any reason why we couldn’t unblock the command queue before starting the actual reads.

But, snapshot or not, if we’d insert the read in the timestamp cache before starting it, wouldn’t that take care of “avoiding the conflicting writes” - such writes would be pushed to a higher timestamp?

Writes would be pushed ahead of our read timestamp, but not far enough ahead to avoid causing ReadWithinUncertaintyIntervalErrors. We’d still need the snapshot to ensure that we don’t start picking up those errors midway through the execution of our read.

I believe you’re right that this unnecessary read uncertainty could happen, but could we use the txn.ObservedTimestamp mechanism to prevent it? Currently, Stores.Send() adjusts txn.MaxTimestamp before running a batch to the value previously observed on that node, if any, and Store.Send() records the observed value at the end of running the batch. It seems to me that we could record and use the observed timestamp before running the batch and that way there’d be no uncertainty with writes that are pushed just after a read.
If this is correct, I would find recording this observed timestamp early clearer than doing it after a batch has run.

Seems reasonable, although I don’t remember all the subtleties well enough to say with confidence that such a change would work.