Cluster is broken after upgrade attempt to 2.1.4

Hi there,

We’ve been evaluating Cockroachdb as a replacement for the main storage in our product. For about two months it’s been running flawlessly, even with relatively high data volume. But today I’ve tried to perform a rolling upgrade from 2.1.3 to 2.1.4 and it all went off rails. I’ve attached a screenshot from admin UI and this is how it’s been looking since upgrade:

In logs there are tons of messages like:

W190215 03:28:16.807392 454083 storage/node_liveness.go:558 [n1,s2,r162112/5:/Table/1869/1/154309{68…-73…}] slow heartbeat took 1.0s

and

E190215 03:28:46.864518 454727 storage/replica_range_lease.go:301 [n1,s2,r55335/4:/Table/1848/1/153000{16…-63…}] mismatch incrementing epoch for {NodeID:4 Epoch:9 Expiration:1550201310.486186299,0 Draining:false Decommissioning:false}; actual is {NodeID:4 Epoch:9 Expiration:1550201328.491538477,0 Draining:false Decommissioning:false}

In terms of setup:

  • 5 nodes on google cloud, 12 CPU/45G RAM each, all running Ubuntu 18.04.2
  • 3 SSD disks of size 1TB attached to each node and all three are configured as store path for cockroach server.

Trying to run cockroach debug zip failing with lots of debug/nodes/1/logs/cockroach.crdb-1.crdb.2019-02-07T07_49_31Z.006492.log: rpc error: code = DeadlineExceeded desc = context deadline exceeded error.

I’ve tried to revert version back to 2.1.3 but it didn’t help.
Luckily we haven’t switched production traffic to crdb yet but this outcome is pretty disturbing.

I’d really appreciate any hints on what’s going on and how to tame it.

Hi Max,

I’m sorry you’re having this trouble. Let’s try to get your cluster back on its feet and then figure out exactly what’s going on.

First, to collect some more information since debug zip isn’t working, I’d like to see screenshots of https://localhost:9090/#/raft/messages/all and copies of https://localhost:9090/debug/pprof/goroutine?debug=1 and https://localhost:9090/debug/pprof/profile?debug=1 (the latter will take about 30 seconds to download). You can email these to me (my first name @cockroachlabs.com).

Next, to try and bring things back up, stop all the nodes, then restart them with the following environment variables set:

  • COCKROACH_RAFT_ELECTION_TIMEOUT_TICKS=60
  • COCKROACH_SCAN_INTERVAL=60m
  • COCKROACH_SCAN_MIN_IDLE_TIME=1s

I think that will stabilize the cluster and allow it to come back up without under-replicated or unavailable ranges. If so, we can make some other configuration changes to bring it into a healthier long-term state. If not, we’ll keep looking (maybe with the help of the files I asked for above).

-Ben

Ben, thank you for your response!

Actually this morning the cluster seems much better, all the underreplicated and unavailable ranges are gone, their amount is back to normal (from 600k to 200k), and there’s a significant drop in lots of graphs:








So apart from raft heartbeats everything seems to be back to normal. Which is great of course, however it took 10 hours to stabilize, during which the cluster wasn’t fully live, which isn’t that great if it were running on production :frowning:
Is there anything we can do to decrease the recovery time or maybe do a postmortem to figure out the root cause of this outage?

I can try again to run debug zip if it can help understanding the problem.

I really appreciate your help, thank you, Ben!

Alright, it’s good that it mostly recovered on its own, although it’s surprising that raft heartbeats and network bytes are much higher after the incident than before. Just to confirm, this is all on 2.1.3 now and not using the environment variables I suggested before, right?

Raft ticking time was trending down in that screenshot. Did it go all the way to zero, and if so did raft heartbeats or network bytes change at that time? (I’d expect heartbeats to show a sudden drop when ticking time nears zero, and network bytes to experience a more gradual decline). The Replica Quiescence chart on the replication metrics dashboard would also be interesting.

How exactly did you perform the upgrade? Did you take everything down at once or restart one node at a time? (if one at a time, how long did you wait in between?) We recommend restarting one node at a time, and that might help avoid issues like this, although it looks like certain timeouts may not be tuned well for this many ranges per node.

The root cause here is that you have a bunch of ranges that are normally in a quiescent state, the restart woke them all up, and the resulting per-range overhead ran into some sort of bottleneck. I’m not sure exactly what the bottleneck was in this case, but in any case the way to fix the problem is to reduce the number of ranges. First, run set cluster setting kv.range_merge.queue_enabled=true (this was off by default because it wasn’t quite stable in 2.1.0, but in 2.1.3 it should be fine). You may see the range count start to drop immediately; if it does wait for it to stabilize before proceeding. (based on the status pages showing 130k replicas taking 1TB = 7MB/replica, I don’t expect much of a drop here)

Next we’re going to change the zone configurations to allow larger ranges. This can increase memory usage in some cases, but you’re well-provisioned in this area so it shouldn’t be a problem. I’ll give these instructions for the default configuration, but if you’ve created any additional zone configs you’ll need to repeat these steps for each zone (at least the ones containing large tables). First run show zone configuration for range default to verify the current state. It’ll look like this:

  zone_name |                config_sql
+-----------+------------------------------------------+
  .default  | ALTER RANGE default CONFIGURE ZONE USING
            |     range_min_bytes = 1048576,
            |     range_max_bytes = 67108864,
            |     gc.ttlseconds = 100000,
            |     num_replicas = 5,
            |     constraints = '[]',
            |     lease_preferences = '[]'
(1 row)

We’re interested in range_min_bytes (default 1MB) and range_max_bytes (default 64MB). Run alter range default configure zone using range_max_bytes=268435456 (256MB). This will stop new ranges from being created as quickly.

Now, unless the range count dropped significantly as soon as you enabled the merge queue, we need to merge more of the ranges that have already split at the smaller size. (We’re looking for the replica count to drop from 130k/node to around 50k). The fastest way to do this is to set range_min_bytes to 128MB, but I think it’s a bit risky to take that big of a jump at once. I recommend going in stages, setting range_min_bytes to the values 16MB, 32MB, 64MB, 96MB, 128MB in turn. Watch the Merge Queue chart on the queues metrics page as you do this; wait for successful and pending actions to reach zero and stay there for a few minutes before taking the next step. If the replicas/node count reaches 50k, go ahead and stop there.

After all of this is done, I recommend setting range_min_bytes back to its original value of 1MB.

Alright, it’s good that it mostly recovered on its own, although it’s surprising that raft heartbeats and network bytes are much higher after the incident than before. Just to confirm, this is all on 2.1.3 now and not using the environment variables I suggested before, right?

Correct, the cluster is still on 2.1.3 and I have not set the env variables you’ve suggested, would it still make sense to set those now btw?

Raft ticking time was trending down in that screenshot. Did it go all the way to zero, and if so did raft heartbeats or network bytes change at that time? (I’d expect heartbeats to show a sudden drop when ticking time nears zero, and network bytes to experience a more gradual decline). The Replica Quiescence chart on the replication metrics dashboard would also be interesting.

Yup, those graphs normalized after 5-6 hours since my post. Overall the entire cluster seems to be completely back to normal, which can be very clearly seen on the graphs:





How exactly did you perform the upgrade? Did you take everything down at once or restart one node at a time? (if one at a time, how long did you wait in between?) We recommend restarting one node at a time, and that might help avoid issues like this, although it looks like certain timeouts may not be tuned well for this many ranges per node.

Yes, this is exactly how I did it - one node at a time, with minimal intervals (1-2 minutes) in between.

First, run set cluster setting kv.range_merge.queue_enabled=true

Done, but there wasn’t any visible decrease in the range count.

Now, unless the range count dropped significantly as soon as you enabled the merge queue, we need to merge more of the ranges that have already split at the smaller size. (We’re looking for the replica count to drop from 130k/node to around 50k). The fastest way to do this is to set range_min_bytes to 128MB, but I think it’s a bit risky to take that big of a jump at once. I recommend going in stages, setting range_min_bytes to the values 16MB, 32MB, 64MB, 96MB, 128MB in turn. Watch the Merge Queue chart on the queues metrics page as you do this; wait for successful and pending actions to reach zero and stay there for a few minutes before taking the next step. If the replicas/node count reaches 50k, go ahead and stop there.

I’ve increased the min_bytes to 16MB and indeed there was a spike in the Merge Queue graph.
The processing speed however was pretty slow (or just the number of ranges is too big) and with that pace bringing the total number of ranges to 50k seemed to might have taken days. Because I had to drop and re-ingest this data anyway (for internal reasons), this is what I’m going to do. But I’ll leave theses parameters set for the new data.

After all of this is done, I recommend setting range_min_bytes back to its original value of 1MB.

How will this impact future data? The current amount of data in crdb is not even a half of what we have atm, and it’s growing every single day. What will happen when the amount of ranges reaches these high values again some time in future, even with adjusted range_max_bytes?
Is there some practical limit on how much data (as expressed in the number of ranges) CockroachDB cluster can handle? In our case we were expecting to migrate around 5TB worth of data with a potential grows to dozens of TB in the year[s] to come. Can crdb handle this much data in theory and in practice?

Thank you, Ben!

No. Those were temporary measures aimed at getting the cluster healthy enough that the cluster settings and zone config statements could work.

Cool. Thanks for those graphs. It’s interesting that pending hearbeats and raft time recovered on a different slope than leaseholders and quiescence.

OK. Yeah, merging is fairly slow when you’re merging ranges that are already large, so if you need to re-import it makes sense to drop and start over. But in case this happens in production you have the option of letting the merge queue do its thing.

If you haven’t dropped the data yet, do ALTER TABLE t CONFIGURE ZONE USING gc.ttlseconds=600 before running DROP TABLE to allow the old data to be deleted sooner (it’s harder to change this setting once the table is already in the trash).

In theory, absolutely. In practice, as you’ve seen there are some rough edges. In version 2.1 there are issues when the number of replicas per node is higher than 50k. This can be mitigated by either running more smaller nodes (instead of 5 12-cpu machines with 3TB of storage, you could run 15 4-cpu machines with 1TB each), or by increasing the range_max_bytes setting. This is something that improves with each release. We’re doing final testing now to see how many replicas our spring 2019 release will be able to support, and we plan to make improvements in this area a focus of the fall 2019 release.

By the way, I’ve filed this issue to track improvements that apply to this incident.

Thank you very much, Ben!

Would it help to debug this issues deeper if I provided debug zip output?

I think I have enough to go on now, but it can’t hurt to have more if you’d like to send a debug zip (my first name @cockroachlabs.com).

Okay, I’ll send you a link shortly. It’s pretty big (the archive) o_O

Thanks for the debug zip. One thing I found in the logs is that you were having clock synchronization problems while all this was going on, especially on node 5.

$ fgrep "clock synchronization error" */logs/*
1/logs/cockroach.crdb-1.crdb.2019-02-14T21_57_03Z.000865.log:F190214 23:04:09.286860 313641 server/server.go:248  [n1] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (3 of 6 are within the offset)
2/logs/cockroach.crdb-2.crdb.2019-02-14T21_14_00Z.016046.log:F190214 21:15:16.066938 253 server/server.go:248  [n2] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (2 of 5 are within the offset)
2/logs/cockroach.crdb-2.crdb.2019-02-14T21_15_26Z.030286.log:F190214 21:15:27.792021 152 server/server.go:248  [n2] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
2/logs/cockroach.crdb-2.crdb.2019-02-14T21_15_38Z.030351.log:F190214 21:15:38.697344 141 server/server.go:248  [n2] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
2/logs/cockroach.crdb-2.crdb.2019-02-14T21_15_49Z.030417.log:F190214 21:15:49.603640 47 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
2/logs/cockroach.crdb-2.crdb.2019-02-14T21_15_59Z.030477.log:F190214 21:16:00.371418 104 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
2/logs/cockroach.crdb-2.crdb.2019-02-14T21_16_10Z.030538.log:F190214 21:16:11.123190 138 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
2/logs/cockroach.crdb-2.crdb.2019-02-14T21_16_21Z.030602.log:F190214 21:16:21.878268 127 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
2/logs/cockroach.crdb-2.crdb.2019-02-14T22_59_33Z.001355.log:F190214 22:59:41.656859 166 server/server.go:248  [n2] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (1 of 2 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_01_24Z.013411.log:F190214 22:51:37.965972 186 server/server.go:248  [n3] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (3 of 6 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_51_49Z.028729.log:F190214 22:51:55.646650 122 server/server.go:248  [n3] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_52_05Z.028791.log:F190214 22:52:06.570454 129 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_52_16Z.028852.log:F190214 22:52:17.347029 73 server/server.go:248  [n3] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_52_58Z.001261.log:F190214 22:53:10.686125 165 server/server.go:248  [n3] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (1 of 2 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_53_21Z.001983.log:F190214 22:53:21.747053 118 server/server.go:248  [n3] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_53_32Z.002230.log:F190214 22:53:32.674559 47 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_53_42Z.002301.log:F190214 22:53:43.431622 87 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_53_53Z.002387.log:F190214 22:53:54.147412 47 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
3/logs/cockroach.crdb-3.crdb.2019-02-14T22_54_04Z.002447.log:F190214 22:54:04.896929 119 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
4/logs/cockroach.crdb-4.crdb.2019-02-14T22_31_20Z.006646.log:F190214 22:43:49.913208 3395 server/server.go:248  [n4] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (3 of 6 are within the offset)
4/logs/cockroach.crdb-4.crdb.2019-02-14T22_44_01Z.023488.log:F190214 22:44:06.345632 149 server/server.go:248  [n4] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
4/logs/cockroach.crdb-4.crdb.2019-02-14T22_44_16Z.023597.log:F190214 22:44:17.192194 113 server/server.go:248  [n4] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T21_08_53Z.027185.log:F190214 21:25:20.429245 13547 server/server.go:248  [n5] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (2 of 4 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T21_25_32Z.010935.log:F190214 21:25:34.045071 29 server/server.go:248  [n5] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T21_25_44Z.010999.log:F190214 21:25:44.812063 58 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T21_25_54Z.011060.log:F190214 21:25:55.586441 151 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T21_27_10Z.011139.log:F190214 21:27:10.986500 167 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T21_27_21Z.011196.log:F190214 21:27:21.812277 45 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_06_15Z.012021.log:F190214 22:19:17.654778 1860 server/server.go:248  [n5] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (3 of 6 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_19_29Z.012277.log:F190214 22:19:31.242642 195 server/server.go:248  [n5] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_19_41Z.012349.log:F190214 22:19:42.005516 109 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_19_52Z.012410.log:F190214 22:19:52.762335 125 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_20_02Z.012472.log:F190214 22:20:03.514569 57 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_20_13Z.012536.log:F190214 22:20:14.260512 119 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_20_24Z.012597.log:F190214 22:20:25.010744 95 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_20_35Z.012656.log:F190214 22:20:35.749293 167 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_20_45Z.012720.log:F190214 22:20:46.513739 156 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_20_56Z.012798.log:F190214 22:20:57.266440 134 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_21_07Z.012865.log:F190214 22:21:08.019384 143 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_21_18Z.012954.log:F190214 22:21:18.774574 57 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_21_28Z.013021.log:F190214 22:21:29.542843 139 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_21_39Z.013087.log:F190214 22:21:40.296674 59 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_21_50Z.013151.log:F190214 22:21:51.049127 107 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_22_01Z.013227.log:F190214 22:22:01.794585 124 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_22_11Z.013292.log:F190214 22:22:12.534979 120 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)
5/logs/cockroach.crdb-5.crdb.2019-02-14T22_22_22Z.013355.log:F190214 22:22:23.262381 169 server/server.go:248  [n?] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (0 of 1 are within the offset)

Node 5 was crash-looping for almost an hour, from 21:25 to 22:22, and other nodes were crashing too (node 3 hit this problem somewhat later than the rest).

Clock sync problems are generally independent of anything else you’re doing with the database. I don’t know of any way that this attempted upgrade and restart could cause clock problems, although the timing is certainly suspicious. Verify that your ntp settings are consistent with our recommendations and check out the clock offset chart on the runtime page of the UI.

For reference, one test cluster we have on GCP usually experiences clock offsets of under 100µs, although this week a couple of nodes have had offsets as high as 180ms. These events correspond with GCE’s “live migrations” which show up in the GCE web UI under “Operations” as “Automatically migrate an instance”. Did any of those happen to your cluster at around the time of the incident? 180ms is a higher offset than we’ve seen with live migrations in the past, and it’s high enough that there’s a significant risk of clock synchronization panics in CRDB.

Yes, I’ve seen these errors. At the same time as doing crdb upgrade I was updating underlying OS packages (apt-get upgrade) as well, so this is what may have caused the clock synchronization issues, I guess, as all the vms were properly configured NTP-wise.