Recommended Production Settings says:
This is well in advance of the maximum offset (500ms by default), beyond which serializable consistency is not guaranteed and stale reads and write skews could occur.
Are write skews caused by writing data derived from stale reads? It seems so since Issue 1305 has the following comment:
The guarantee you lose in this small time window is that a client may fail to read a value that was written before the read started. Reads could see stale data. Writes are mostly unaffected by clocks except to the extent that the application may write data derived from stale reads.
And I can work out the following detailed scenario to show how write skews happen. Consider this CockroachDB setup:
Assume that node1 has a fast clock and node2 has a slow clock. And the clock offset between their clocks exceeds the maximum-offset. The initial value of k1 is k1(T0) = 0. k1(T0) means that k1’s value at timestamp T0 is 0. Txn1 updates k1 to 1. Txn2 reads k1 and updates k2 to the sum of k1’s value and 10.
Here are the details steps:
- Client A sends Txn1 to node1 through HAProxy.
- Node1 as Txn1’s coordinator assigns its local clock T1 as Txn1’s candidate timestamp.
- Txn1 updates key k1 to set k1(T1)=1 on node1. Node1 holds the primary raft replica of the range which contains k1. And node1 is the lease holder for that range.
- Node1 commits Txn1 with T1 as the commit timestamp.
- Node1 sends Txn1’s result to client A though HAProxy.
- Client A sends Txn2 to node2 through HAProxy.
- Node2 as Txn2’s coordinator assigns its local clock T2 as Txn2’s candidate timestamp.
- Node2 reads k1(T0)=0 instead of k1(T1) from node1 since
T1 -T2 > maximum-offset.
- Txn2 updates key k2 to k1(T0)+10=10 on node2. Node2 holds the primary raft replica of the range which contains k2. And node2 is the lease holder of that range.
- Node2 commits Txn2 with T2 as the commit timestamp.
- Node2 sends Txn2’s result to client A through HAProxy.
Write skew happens in the above steps since Txn2 updates k2 with the stale value of k1. Is my scenario correct?
But @bdarnell in Issue 1303 uses lease expiration to illustrate stale reads and write skews. This is how Ben describes stale reads?
If one node with a slow clock holds the lease for a range, it could expire and a node with a fast clock could acquire it and start making writes while the slow node still thinks its lease is valid and continues to serve reads.
This is how Ben describes write skews?
Node A has a slow clock, and it has a lease that expires at 12:00:00. It believes the current time is 11:59:59, so its lease is still valid. Node B has a fast clock, so it believes that the current time is 12:00:01 and node A’s lease is expired. Node B will try to acquire a lease (and if node C also has a fast clock, this lease will be granted). Node A will keep serving reads even as nodes B and C are serving new writes.
Write skew would happen if you had two transactions that each did a read then a write, and their reads saw inconsistent results because of the above issue.
So there are two kinds stale reads caused by clock drift. One involves lease expiration. One does not. Is my understanding correct?