Clock drift inconsistencies

In the Jepsen report there’s this:

Because CockroachDB assumes semi-synchronous clocks, unless otherwise noted, we only introduce clock offsets smaller than the default threshold of 250 milliseconds. When clock offset exceeds these bounds, as measured by an internal clock synchronization estimator, all bets are off: some nodes will shut themselves down, but not before allowing various transactional anomalies.

Is this by design? I imagine that having data become inconsistent because a node lost its connection to the NTP server may be prohibitive in many situations.

Yes it is by design. Note that losing connection to the ntp server is not sufficient to cause problems: the clock then has to actually drift, which (we hope) is unlikely to happen so fast that the operator won’t have time to notice the missing ntp synchronization.
For more information check this:

Well all clocks are expected to drift according to their skew rate if ntp synchronization is interrupted. We have seen this in production (Google GCE instances once had a bad image which didn’t start their clock synchronization daemon). But the drift is relatively slow. CockroachDB will kill any node which exceeds a 250ms offset from the cluster mean time. Consistency issues can only arise if the offset exceeds 500ms, so there is a very significant cushion to prevent clock-offset related issues.

Keep in mind that even if your clocks go AWOL instantly, there’s only a 3s window before nodes will be forced to shutdown. Within that 3s window, a significantly adversarial set of conditions must be present to encounter inconsistencies. The actual inconsistency you would see in this case is a failure to read data which was written by the node with a bad clock (that bad clock must be set forward by more than 500ms).

Recommended Production Settings says:

This is well in advance of the maximum offset (500ms by default), beyond which serializable consistency is not guaranteed and stale reads and write skews could occur.

Are write skews caused by writing data derived from stale reads? It seems so since Issue 1305 has the following comment:

The guarantee you lose in this small time window is that a client may fail to read a value that was written before the read started. Reads could see stale data. Writes are mostly unaffected by clocks except to the extent that the application may write data derived from stale reads.

And I can work out the following detailed scenario to show how write skews happen. Consider this CockroachDB setup:

11 PM

Assume that node1 has a fast clock and node2 has a slow clock. And the clock offset between their clocks exceeds the maximum-offset. The initial value of k1 is k1(T0) = 0. k1(T0) means that k1’s value at timestamp T0 is 0. Txn1 updates k1 to 1. Txn2 reads k1 and updates k2 to the sum of k1’s value and 10.

Here are the details steps:

  1. Client A sends Txn1 to node1 through HAProxy.
  2. Node1 as Txn1’s coordinator assigns its local clock T1 as Txn1’s candidate timestamp.
  3. Txn1 updates key k1 to set k1(T1)=1 on node1. Node1 holds the primary raft replica of the range which contains k1. And node1 is the lease holder for that range.
  4. Node1 commits Txn1 with T1 as the commit timestamp.
  5. Node1 sends Txn1’s result to client A though HAProxy.
  6. Client A sends Txn2 to node2 through HAProxy.
  7. Node2 as Txn2’s coordinator assigns its local clock T2 as Txn2’s candidate timestamp.
  8. Node2 reads k1(T0)=0 instead of k1(T1) from node1 since T1 -T2 > maximum-offset.
  9. Txn2 updates key k2 to k1(T0)+10=10 on node2. Node2 holds the primary raft replica of the range which contains k2. And node2 is the lease holder of that range.
  10. Node2 commits Txn2 with T2 as the commit timestamp.
  11. Node2 sends Txn2’s result to client A through HAProxy.

Write skew happens in the above steps since Txn2 updates k2 with the stale value of k1. Is my scenario correct?

But @bdarnell in Issue 1303 uses lease expiration to illustrate stale reads and write skews. This is how Ben describes stale reads?

If one node with a slow clock holds the lease for a range, it could expire and a node with a fast clock could acquire it and start making writes while the slow node still thinks its lease is valid and continues to serve reads.

This is how Ben describes write skews?

Node A has a slow clock, and it has a lease that expires at 12:00:00. It believes the current time is 11:59:59, so its lease is still valid. Node B has a fast clock, so it believes that the current time is 12:00:01 and node A’s lease is expired. Node B will try to acquire a lease (and if node C also has a fast clock, this lease will be granted). Node A will keep serving reads even as nodes B and C are serving new writes.
Write skew would happen if you had two transactions that each did a read then a write, and their reads saw inconsistent results because of the above issue.

So there are two kinds stale reads caused by clock drift. One involves lease expiration. One does not. Is my understanding correct?

Yes, in this context that’s what write skew means. (Write skew is also a transactional anomaly that can be present when using SNAPSHOT isolation instead of SERIALIZABLE).

Yes, but you also have to consider that only one node can have the lease for a key at a time. For this scenario to happen, you need to add the following:

  • 0.1: node 2 initially has the lease for the range containing k1
  • 0.2: because node 1 has the faster clock, it thinks node 2’s lease is expired, so it takes over the lease.

Then at steps 6-8, node 2 with its slow clock believes its old lease is still valid, so it serves reads from T1. If node 2 had never held the lease, it would simply redirect the request to the leaseholder, node 1.

Currently, since all reads go through the lease holder, all stale reads involve lease expiration. This may change as we implement follower reads.

I think that you mean T0 in saying so it serves reads from T1.

My original scenario misses the following conditions:

  1. k1 and k2 are from different ranges.
  2. Node2 never has the lease of the range which contains k1.

And I should have written step 8 in the following detailed way:

Node2 redirects the request to the leaseholder node 1. At this time, node 1 is still the valid leaseholder (no lease expiration issue). Node1 returns k1(T0)=0 instead of k1(T1) from node1 since T1 -T2 > maximum-offset. k1(T1) is a value with a newer timestamp far enough in the future. So it should be ignored. Here node1 obeys the following rule from the design doc:

Reader encounters write intent or value with newer timestamp far enough in the future: This is not a conflict. The reader is free to proceed; after all, it will be reading an older version of the value and so does not conflict.

How do you think with these additional details?

Yes, I meant it reads from T0.

I see what you mean now, and you’re right. I was forgetting about the role of the gateway node (which is where timestamps are actually assigned). If the gateway node has a slower clock than the lease holder (by more than the max offset), then you can see stale reads without lease expiration races.

The design doc says that we can’t ignore the write intent or value and just read an older version in the following condition. Instead, we need to restart the reader.

Reader encounters write intent or value with newer timestamp in the near future

Is the stale read which causes my write skew scenario also the reason for this rule?

Yes. If the difference between T1 and T2 is less than the max offset, then the read transaction will restart with timestamp T2 to see this read (which happened before it in real time but after it in database time)