Golang time not monotonic in certain situations - affects CRDB?

Just curious if CRDB is aware of this issue that time is not monotonic in golang? The hlc library that CRDB uses does assume a monotonic physical time on each node for the logical counter to be bounded (i.e. it assumes that the time difference between two consecutive operations on a single node is at least one nano second), but CRDB is still able to deal with this golang issue?

Also, for my own benefit, what do you guys think are reasonable numbers for max clock skew and minimal ns between two events in a typical deployment?

Hi @susan,

Yep, we’re aware of the lack of monotonicity for time.Now(). I’m not sure I understand your other point. The hlc package does check that each call to time.Now() is increasing, and if it isn’t it reuses the maximum time seen so far while incrementing the logical component. Can you point to the code which is assuming that the time difference between two consecutive operations on a single node is at least one nano second? We definitely have tests which use hlc timestamps with identical WallTime components and differing by 1 in the Logical component.

Thanks Peter,
I just mean that the TR’s goal is to create a hybrid logical clock with a bounded logical component. As noted in your code, the bound is given by (maximum clock skew)/(minimal ns between events). So the assumption is that minimal ns between events is positive. If golang gives a time that is not monotonically increasing, I agree that your timestamp is still monotonically increasing, but the “proof” of boundedness, so to speak, doesn’t hold without some additional caveats. So to put my question another way: under what circumstances could the logical component wrap around and create a bug? What would be the max clock skew and the highest logical component observed in a typical deployment?

While the golang clock (i.e. time.Now()) is not guaranteed to be monotonically increasing, we do assume (and check) that the clock skew between nodes is within some maximum offset. That maximum offset also bounds the expected clock jumps. If a clock jumps backward 500ms we’ll be incrementing the logical component of the hlc for 500ms. Can we exhaust the 32-bits of logical component in that time frame? Not on today’s processors. A larger backwards clock jump would cause our clock skew detection code to fire. The maximum offset we can tolerate is configurable. I suppose if you configured it too large I think we could exhaust the logical component.

Note there is a window of vulnerability between a clock jump occurring and our detecting it which is an opening for wrapping the logical component of the hlc. We should probably put in a check that the logical component doesn’t wrap. Better for the node to commit suicide in that case. And it might be worth adding a check that the physical time doesn’t jump backwards larger than our configured max offset. Any interest in contributing?

Thanks for explaining! I think those two checks aren’t too crucial after understanding things in detail - but will consider taking them up when I have some spare time - would be an easy low priority fix.

One last question: Is it enough for all the HLC’s to be sent and received only in the underlying Raft messages? Are there any other messages in CRDB that convey HLC (e.g. gossip) for correctness reasons? It seems that nodes hosting disjoint ranges don’t need to worry about the HLC of each other - only nodes in the same Raft cluster, because Transactions can just use the timestamp of the first node they touch and deal with conflicts themselves. This blog post seems to imply that, IMO. It should be safe if your database is partitioned into one cluster that has a higher HLC and another that has a lower HLC, but it could result in lots of aborted transactions that span those two clusters or it could result in the weirdness mentioned in that post where one transaction commits before another transaction but at a later timestamp for transactions operating on disjoint Raft clusters.

No, it’s not, for the reasons you cited (it wouldn’t produce incorrect results, but an unacceptable number of aborted and retried transactions). We transfer HLC timestamps on most messages in the system (including all KV operations and our periodic health checks on every connection) to keep the HLCs as up-to-date as we can.