Issues with NTP and 500ms+ clock drift

Overnight half my 10 node cluster died after seeing a lot of clock related issues where nodes were 500ms apart:

13 server/server.go:190  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
goroutine 13 [running]:, 0x0, 0x0, 0x10)
	/go/src/ +0xa7*loggingT).outputLogEntry(0x3220720, 0xc400000004, 0x2a66809, 0x10, 0xbe, 0xc4206a2750, 0x88)
	/go/src/ +0x5bc, 0xc42040ca80, 0x4, 0x2, 0x0, 0x0, 0xc4206efcd8, 0x1, 0x1)
	/go/src/ +0x28c, 0xc42040ca80, 0x1, 0x4, 0x0, 0x0, 0xc4206efcd8, 0x1, 0x1)
	/go/src/ +0x8c, 0xc42040ca80, 0xc4206efcd8, 0x1, 0x1)
	/go/src/ +0x76
	/go/src/ +0xba*Context).runHeartbeat(0xc42069bc70, 0xc4202ac7c0, 0xc420297b40, 0x14, 0x0, 0x0)
	/go/src/ +0x58e*Context).GRPCDial.func1.2.1(0x7f3197089b68, 0xc420ac08d0)
	/go/src/ +0x6b*Stopper).RunWorker.func1(0xc420237030, 0xc4206a2000, 0xc420ac0840)
	/go/src/ +0xf7
created by*Stopper).RunWorker
	/go/src/ +0xad

This isn’t the first time this has happened, as we had an instance where a single node died due to the same 500ms drift a couple weeks ago. Honestly we have never really paid much attention to the configuration of our NTP servers and clients as they are setup to get time from stratum 1 servers and have faithfully run with clocks synced at least to the second for years. But now that we are looking at sub 500ms intervals between clocks I am wondering if there is a recommended time synchronization software or ntpd configuration that would solve this issue?

Thanks for reporting this, @somecallmemike. Based on this discussion, we’ve also seen the need to ensure NTP is running correctly in different environments. Unfortunately, we don’t yet have helpful docs here, though improvements are planned for the next major release.

In the meantime, if you can provide more details about your deployment environment, I can try to loop in someone to help.

We usually run ntpd with default config and that generally works well.
We have seen some issues on certain VMs where the VM manager and ntp fight to set the clock. This has historically been a problem on Hyper-V (used mostly by Azure).

One thing to be aware of is that you may need to run ntpdate -b <time server> before starting ntpd to force a sync at startup. Otherwise, ntpd will take too long to fix the clock.

Can you share some more details about the VMs you are using?

Our CRDB nodes are running the latest CentOS 7 in VMware between two DCs which are about 100 miles apart as the crow flies. We own and operate a 100gb fiber ring between them, so connectivity is as rock solid as it can be. In each DC we have a physical stratum 2 NTP server running ntpd, and each of them is configured to source their clocks from a number of open access stratum 1 servers. Each CRDB node is sourcing clocks from the stratum 2 physical boxes; we have never relied on VMware for clock synchronization. It’s a pretty basic setup designed to get clocks as close to stratum 0 as possible and avoid any hypervisor related clock issues.

I’m surprised you’re reaching 500ms drift with that setup. Assuming reasonably well-behaved VMs, ntp should have no trouble keeping well below 100ms.
Do you have any monitoring for your OS-level clock drift? We use the prometheus node_exporter with -collectors.enabled="conntrack,diskstats,entropy,filefd,filesystem,hwmon,loadavg,mdadm,meminfo,netdev,netstat,ntp,sockstat,stat,textfile,time,uname,vmstat" -collector.ntp.server <your ntp server>. I think they removed the ntp collector after 0.13.0 so we’re still on that version.

You can also see each cockroach node’s view of its clock drift relative to the rest of the cluster through the /_status/vars metrics:


This is plotted on the “runtime” dashboard included in our prometheus/grafana example.

@marc, we did some research and found out that Red Hat recommends having more than two NTP servers in your client configuration in order to form a quorum in the event there is a falseticker which would skew your weighted time calculation. We added two more NTP servers to our system and configured the clients to read all four. We also added ‘tinker panic 0’ to our config per VMware’s recommended NTP client configuration so that the ntpd daemon didn’t abandon synchronization if the clocks suddenly skewed due to a host hardware issue. These changes seem to have eliminated the clock skew as we are seeing sub 100ms differences between clocks.

Here is a sample config in case anyone else needs one:

tinker panic 0

restrict default kod nomodify notrap nopeer noquery
restrict -6 default kod nomodify notrap nopeer noquery
restrict -6 ::1

server w.w.w.w iburst
server x.x.x.x iburst
server y.y.y.y iburst
server z.z.z.z iburst

driftfile /var/lib/ntp/drift