How to make the most use out of Amazon Time Sync Service?

From what I understand, this is an atomic clock service delivered via NTP, see https://aws.amazon.com/about-aws/whats-new/2017/11/introducing-the-amazon-time-sync-service/

Now I’m wondering, is it possible to improve the performance (latency) of Cockroach thanks to this? If so, what settings should I change?

A similar question was actually asked a few days ago - Using PTP instead of NTP?

To summarize, if you have a lot of conflicting transactions, you should theoretically see some performance increase, but likely not much difference in other cases.

If you want to give it a shot, you can add a --max-offset=XXXms flag to your start commands with a smaller value than the default 500ms. That value needs to be consistent across the cluster, however, so if you already have one running, you’ll need to stop all nodes before adding the flag and starting them back up.

I spoke to AWS this week to try and get some more details about the level of service we can expect for aws time sync.

They were quite tight-lipped so I don’t have any figures I can give you.

They did seem extremely confident that the drift should be very small during normal operation, but I was specifically interested in worst-case behaviour. It did sound like it would be small enough to get full linearizability.

Also, since their service uses leap second smearing just like google, it should be a bit friendlier to usecases like this.

We haven’t had a chance to do much testing with Amazon’s time sync service yet, but it looks like a good solution for applications deployed to that platform. One warning, though: they do leap second smearing, which is good but non-standard. If any of your nodes use time sync sources that do leap-second smearing, they all must (and they must all smear the leap second the same way. I haven’t seen any official confirmation of whether Google and Amazon’s leap second smearing is compatible with each other).

In general, better clock synchronization can improve your tail latencies (by reducing retries) but not mean/median performance. To realize these gains, you’ll need to set the --max-offset flag to something smaller than the default of 500ms. I don’t know what the best value for amazon time sync would be - my recommendation would be to run a test cluster for a while to collect data (we’re not graphing that yet, so the results won’t be visible in a current build but it’ll be collecting data that can be displayed once the UI is updated). One nice thing about truetime is that it explicitly models the uncertainty in the clocks, so that the delay can be adjusted dynamically based on the actual conditions instead of choosing a fixed max-offset.

On our AWS test clusters (just using standard NTP with external sources, not AWS time sync), we typically see offsets in the single-digit milliseconds with spikes up to ~20ms. The default max-offset of 500ms is pretty conservative.

If the max clock offset gets small enough, you could switch to linearizable mode (which adds the max clock offset to all your read latencies). However, this mode has not yet been tested in practice and I wouldn’t yet recommend using it.

1 Like

I was interested in this and asked AWS how their Time Sync Service handles leap smear. Their reply:

  1. The smearing algorithm uses a quadratic spline function.
  2. The slewing rate of the smearing server adjusts its local clock to 1000 ppm.
  3. The server time smoothing process will start when the clock gets to 00:00:00 UTC and it will take 17 hours 34 minutes to finish.
  4. The smearing servers frequency offset will be changing by 0.001 ppm per second and will reach a maximum of 31.623 ppm.

My read of this is that it is not compatible with Google’s leap smear :disappointed: I might be mistaken, but my understanding is that this makes it unsafe for a CockroachDB deployment that spans both AWS and GCP to use the AWS Time Sync Service, unless the --max-offset was very high. Is that correct?

Correct. If you’re spanning AWS and GCP you’ll need to use the same time source for both of them. In practice this means using google’s time source, since theirs is public and (last time I looked) amazon’s is only accessible from within AWS.

It’s interesting that amazon has changed their policy here. In 2015, they did a 24-hour smear that appears to be compatible with google’s implementation.

1 Like