We’ve been running 2.1-rc1 and slowing upgrading to 2.1-rc2 but we’ve been seen numerous times now where the cluster becomes completely unavailable for an unknown reason. I’m not sure that the issue is specifically caused by 2.1 since we’ve only started using CockroachDB since the 2.1 betas.
We run 20 n1-standard-2 servers across 6 regions and across 2 continents. Usually the issue is isolated to a single continent because we have a table per continent and use constraints to limit which servers hold data for each table. Typically what happens is that a single server has some short momentary issue (yet to identify) which then causes a cascading failure where the cluster tries to transfer leases away from that server but that process causes every server to become overloaded and sit at 100% CPU utilization, which then causes more lease transfers. During this time our application starts failing as queries start to take multiple seconds and stays that way until we stop all instances of the application and wait it out until the dust settles and then start it up again.
I’ve been pouring over the logs but I never seem to see anything around the time that things start to slow down.
In the last occurrence, there was an issue with some server in europe-west3 (unknown what the issue was) and that caused the QPS to drop significantly around 13:02 UTC as the 95th percentile of queries went over 4 seconds. The QPS started to drop in europe-west1 and europe-west2 around 13:05 UTC.
I don’t see anything in the logs until 13:03 where it said:
“batch [2/205/0] commit took 575.40404ms (>500ms)”
as well as a bunch of:
“slow heartbeat took …”
I’m not sure what graphs to show to help diagnose but here’s some of the application metrics:
Here’s a graph of the KV transactions spiking but this started after the QPS already started to drop:
The queries to one of the servers in europe-west3 shot up and I think this is what perpetuated the problem:
There were lots of lease transfers as I can only assume they thought the server above was dead since it was stuck at 100% CPU and could barely respond to anything:
Let me know what else I can provide to help. During the outage I tried setting kv.raft_log.synchronize=false
to see if that helped the raft log commit times being so high but that didn’t seem to help. I also got a debug zip as well as things started to return to normal.