End-to-end raft proposal flow and its implication on transaction


I am looking at the CockroachDB code on how the end-to-end flow of a write request, especially the raft replica level. It seems the tryExecuteWriteBatch handles the core logic, which proposes the request to raft, and wait until the request is applied to the state machine. It also returns whether this write request is retryable or non-retryable. If that is correct, my questions are:

  1. How do you handle the case when the write is committed in raft (with quorum consensus), but failed to apply to the state machine? In that case, the write request may be abandoned, but the log is actually committed which will be applied by raft eventually. If this is a transaction write, then does that mean the transaction is aborted, while actually it’s committed in raft? Just wondering if this is an issue to you, or you simply relies on the idempotency of state machine.

  2. Related to that, I wonder if it’s possible that tryExecuteWriteBatch returns as soon as the write request is committed in raft, and don’t wait for the apply to happen. And for a transaction, we mark it as committed the same way.

  1. Most of the complicated logic happens in the “evaluate” step, before the command is proposed to raft. The “apply to the state machine” step is pretty simple: it performs some validation of the request (which will either succeed or fail in the same way on each replica), then sends a precomputed batch of writes to rocksdb. The only way this can act differently on different replicas is if there is some sort of failure at the rocksdb level (such as running out of disk space). When this happens we mark the replica as corrupt and stop using it.
  2. We have to make sure that when a write or commit is followed by a read, the read will see the results of the previous transaction. It’s difficult to ensure this if we return to the client before the command has been applied (and the application of the command is relatively quick, compared to the time it took to commit it). Some optimizations are possible here (for example, this PR changes things so we don’t have to wait for the sync to disk when applying the command, only when committing it), but there are a lot of subtleties and the gains we’ve seen aren’t that large.

Hi Ben,

Thanks a lot for the detailed explanation.

  1. Yeah, agree that the “unexpected” behavior only happens when there is fatal failure at the rocksdb level, in which case the original proposal will fail, but may be applied in other healthy replicas if the proposal has been committed by raft. I guess this case is unavoidable and client may need to deal with such “unknown” state.

  2. Good to know that you have investigated this option. It’s true that there will be additional complexities to track the “committed but not yet applied” requests, in order for a read to see the results of previous transactions.