Slow replication after node replacement

General issue decribed here:

Essentially we are replace nodes for OS upgrades. After the node replacement the under replicated ranges take more than 30 min to replicate. We’re managing something between 20 - 30 MB of data. I would expect the replication to take less time than that.

Any idea, strategies on what we could do to decrease time?

Hi @nate-kennedy,

We do have a cluster setting, kv.snapshot_recovery.max_rate, that can be adjusted to upreplicate faster. However, the default is 8MiB/s, so I doubt that’s your issue here with 30MiB of data. It’d be helpful to see a cockroach debug zip so we can look for other clues to what might be slowing things down. It’d also be helpful to see the output of the network latency report (located at http://<adminurl>/#/reports/network).

Let me know if you need a private store for the debug zip and I can provide one.

Hey @tim-o, Thanks for getting back so quick. Yes I can confirm that kv.snapshot_recovery.max_rate is at the default value of 8MiB/s.

Latency does not appear to be a concern either:

I have generated a debug log. Can you provide the store location so I can ship the logs to you?

Hi @nate-kennedy,

Invite sent - let me know if you didn’t receive it.

@tim-o Invite received. I have uploaded a debug bundle from a node that was slow to replicate.

Hey @nate-kennedy - it looks like the zip doesn’t contain log folders - they should be under debug/nodes/<nodeId>/logs/<logfile>. I don’t see the folders at all. Do you see the same on your end after unzipping the archive?

Were you running with logs to stderr?

Hi @tim-o Yes we were running with the --logtostderr flag. The cockroach process on our nodes run as a systemd service and thus log to journald. Unfortunately the logs have rotated since the node replacement.

We have another node replacement scheduled for Monday. I will ensure to capture the logs from that activity and share the information back to you.

Thanks again for all your help so far.

Hi @tim-o,

I have uploaded a new debug zip and log for the issue. In this instance, it took approximately 15 minutes for the under replicated rages to reach 0.

Please let me know if you need an addt’l info. Thx!

Received! A couple observations:

  • Given only 20-30mb of data, I’m surprised to see so many ranges. It looks like there are a couple hundred at least. Did you presplit a table, or is it spread out across many tables?
  • It looks the cluster begins to upreplicate almost immediately after n6 is taken down, as expected.
  • The logs indicate that n9 is busy applying new snapshots (new ranges) for roughly half an hour, as you described. You can see a number of "store

What’s interesting is the length of time it takes to apply snapshots: if you search through the logs for “streamed snapshot” you can see that it takes somewhere between 2ms and 1.5-1.6s to apply each. They vary in size from tens of kv pairs to thousands, and the larger the snapshot in terms of kv pairs, the longer it takes to apply.

How large is each row in your dataset? I can’t see any indication of a problem on n8, at least - the logs are consistent with a cluster trying to upreplicate a significant amount of data. The only part that doesn’t fit is the fact that it’s only 20-30MB. Has the cluster been up and running for a long time? Is there more data in the system ranges? You can check the size of the system and time series data at http://<adminurl>/#/databases/tables.

The only other thing I can think of checking is the log and command commit latency on n9. What kind of disks are you using?

Hey @tim-o I snapped that from the page you asked I visit. In regards to you other questions:

  • Did you presplit a table, or is it spread out across many tables? No there are only 7 + 11 tables across the two DBs.
  • How large is each row in your dataset? Not very large see screenshot #2
  • Has the cluster been up and running for a long time? Yes… I’d say the cluster has been around for more than a year now.
  • Is there more data in the system ranges? See the first screen shot. The first box is for our data. The 2nd box is for system.
  • What kind of disks are you using? This is a 100 GB aws gp2 device. So the base line is somewhere around 400 IOPS and burstable to several thousand.

One other data point I’m hoping you can help me to reconcile is the disparity between the used capacity from storage view (see screenshot # 3) and what is listed in the DBs (screen shot #2). As you can see from the snapshot storage there are several gigabytes in use. Is that expected?