Unrecoverable performance degradation with 2.1

(James Hartig) #1

We’ve been running 2.1-rc1 and slowing upgrading to 2.1-rc2 but we’ve been seen numerous times now where the cluster becomes completely unavailable for an unknown reason. I’m not sure that the issue is specifically caused by 2.1 since we’ve only started using CockroachDB since the 2.1 betas.

We run 20 n1-standard-2 servers across 6 regions and across 2 continents. Usually the issue is isolated to a single continent because we have a table per continent and use constraints to limit which servers hold data for each table. Typically what happens is that a single server has some short momentary issue (yet to identify) which then causes a cascading failure where the cluster tries to transfer leases away from that server but that process causes every server to become overloaded and sit at 100% CPU utilization, which then causes more lease transfers. During this time our application starts failing as queries start to take multiple seconds and stays that way until we stop all instances of the application and wait it out until the dust settles and then start it up again.

I’ve been pouring over the logs but I never seem to see anything around the time that things start to slow down.

In the last occurrence, there was an issue with some server in europe-west3 (unknown what the issue was) and that caused the QPS to drop significantly around 13:02 UTC as the 95th percentile of queries went over 4 seconds. The QPS started to drop in europe-west1 and europe-west2 around 13:05 UTC.

I don’t see anything in the logs until 13:03 where it said:
“batch [2/205/0] commit took 575.40404ms (>500ms)”
as well as a bunch of:
“slow heartbeat took …”

I’m not sure what graphs to show to help diagnose but here’s some of the application metrics:

Here’s a graph of the KV transactions spiking but this started after the QPS already started to drop:

The queries to one of the servers in europe-west3 shot up and I think this is what perpetuated the problem:

There were lots of lease transfers as I can only assume they thought the server above was dead since it was stuck at 100% CPU and could barely respond to anything:

Let me know what else I can provide to help. During the outage I tried setting kv.raft_log.synchronize=false to see if that helped the raft log commit times being so high but that didn’t seem to help. I also got a debug zip as well as things started to return to normal.

(James Hartig) #2

One odd thing I noticed is that the number of “lease holders” has dropped (red line):

The above screenshot is the last 6 hours.

(Tim O'Brien) #3

@fastest963, can you provide the output of cockroach debug zip? I’ll open an issue for you so you can send it privately, keep an eye out for it momentarily.

(Tim O'Brien) #4

Just sent. Let me know if you didn’t receive it.

(James Hartig) #5

Yup! I received it, thanks! I responded with a link to the zip since it was too large to reply with

(James Hartig) #6

I left the application in europe-west3 region offline until last night when I tried to re-enable it and almost immediately the QPS in all 3 Europe regions dropped to almost 0 until I stopped the application in europe-west3 again. This happened after upgrading the Europe servers to 2.1 stable.

Here’s the queries from each application server in the Europe continent:

Query percentile graph:

Oddly enough, node 8 had a massive spike in queries despite being in us-west1 and not having any ranges for the table that the Europe application talks to:

We setup the table with:

ALTER TABLE eu_sessions CONFIGURE ZONE USING constraints = '[+continent=eu]', gc.ttlseconds = 3600;

So the server getting a lot of queries might just be a coincidence?

The distributed queries graph showed a big spike which is interesting because I don’t think any of our typical queries are distributed:

It does seem like the same spike in distributed queries happened when the original issue happened as well.

I can send over a new debug zip if you would like, for this second instance, let me know if I should use the same email as before or a new one.

(James Hartig) #7

I noticed a lot of:

15630845463d33cf000000000000000b |      11 | root      | 2018-11-01 15:03:04.324157+00:00 | SELECT "hashedPassword" FROM system.public.users WHERE (username = $1) AND ("isRole" = false) | <admin>        | $ internal-get-hashed-pwd |    true     | executing

In the running queries. If I switched to certificates instead of a password, would that help?

(Ron Arévalo) #8

Hey James,

Can you send over the debug zip from today where you experienced the same issue? Same method as yesterday would be fine.



(James Hartig) #9

I the zip over yesterday from the second incident. I also upgraded all of the servers from 2 CPU to 4 CPU and so I’ll be slowing ramping up traffic and I’ll see how they hold up today.

(Tim O'Brien) #10

Thanks @fastest963 - you can keep communication entirely on the zendesk ticket if that’s easier, I’ll update the community once we find the root cause.