How range lease avoid stale reads?

Hi,

I’ve been interested in how read/write path looks like in CockroachDB and how strong consistency is guaranteed (e.g. no stale reads). I understand that you uses “range lease” to serve reads/writes on a range lease holder, which enables reads to be served locally without going through raft. It also addresses the clock drift by introducing a stasis period before the lease expires.

However, I don’t understand how range lease can solve the stale read problem during fail-over case. Say a range holder fails right after committing a write to its local state machine successfully, but the other replicas may have not committed yet. I suppose a new node will be selected as the next lease holder, and a read to the new range holder may not read the already-committed write in the previous lease holder? Is that a concern that CockroachDB has addressed during the lease transfer stage?

Hi Jia,
Writes still go through Raft and they only complete once they reach a quorum (majority) of replicas. In your example, if the write didn’t make it to the other replicas it could not have completed.

Hi Radu,

Sorry for the confusion, but write is completed if the log is replicated across a quorum number of nodes, which not necessarily mean they’ve all been applied to their state machines, right? I am wondering what happens when a read is issued to a replica which is the new lease holder but does not have the write applied yet.

Oh, I saw this on the raft thesis: “The leader waits for its state machine to advance at least as far as the readIndex; this is current enough to satisfy linearizability”. If that’s the case for CockroachDB’s raft, it makes sense to me.

Taking a lease is a Raft operation; all the writes will be applied before a new leaseholder is minted.

https://github.com/cockroachdb/cockroach/blob/master/docs/design.md#range-leases provides a good explanation for range leases and how cockroachdb avoids stale reads