High service latency

I have a 3 node cluster running. I ran a very small test load on the cluster and let it run for a a couple of hours. By the end, the cluster performance is severely degraded.

Even after stopping the test and just running one query per second, the response takes a very long time (my client times out after 60s so I don’t know exactly how long).

This issue has happened once before to me, using a single node local instance.

The service latency graph shows latency of 10.2 seconds and there’s almost no change. It’s either exactly 10.2 seconds or under 100ms. I have attached an image. (The node with 0ns is decommissioned.)

What is happening here? What can I do to fix this or diagnose this?

@pooper If you run set cluster setting sql.trace.txn.enable_threshold = '1m' in a sql console, it will start printing out detailed performance
information for the queries that take longer than the threshold (I plugged
in 1s for you). This information shows up in the cockroach log and will
help us debug your slow queries. This cluster setting slows down every
query while it’s on, so you don’t want to leave it on in production, but
it’s great for these load tests.

Thanks dan.

Here is one of the results of a long query: http://termbin.com/gvxu

The table being queried is empty.

Here is the node status table: https://pastebin.com/raw/F4YYutbY

Shouldn’t the replicas_leaseholders values sum up to be the same as ranges?

(Also, how do I turn off the trace setting?)

Edit:
when I run ‘show queries’, there are a lot of queries that are still running from hours ago and the list isn’t getting any shorter.

@pooper The “waiting for overlapping request” lines indicate that you have
contention, though I’m surprised by how long the waits are. Can you share a
little more about your workload?

Also, is there any chance that your client code is leaving transactions
open for a while (more than a second)?

During the load test, rows were being added to an inputs table at a rate of 6-12 per second.

During and after the load test (after the db stopped responding):

This query was running exactly every 1 second:
SELECT * FROM inputs WHERE queue = $1

This query was running roughly every 1-2 seconds:
DELETE FROM inputs WHERE (queue = $1) AND (round < $2) RETURNING NOTHING

There are also many other reads happening at the same time irregularly.

I have not wrapped any of my queries in a transaction.

In the list of long running queries, the delete query was the one that had still not finished after hours. After manually canceling the long running queries, response times returned to normal.

I removed RETURNING NOTHING from the delete query, and updated my client to wait for it to finish before running another one. Monitoring the active queries, even when there no or few active queries, the service latency sometimes still goes up to 10-16 seconds.

Looking at the logs, I get a bunch of these errors:

W171109 18:31:13.801586 229 storage/node_liveness.go:399 [n2,hb] slow heartbeat took 2.3s
W171109 18:31:13.801610 229 storage/node_liveness.go:342 [n2,hb] failed node liveness heartbeat: context deadline exceeded
W171109 18:31:14.778698 27413233 storage/node_liveness.go:399 [n2,s2,r403/3:/System/tsd/cr.node.exec.late…] slow heartbeat took 10.3s
E171109 18:31:14.778725 27413233 storage/replica_range_lease.go:192 [n2,s2,r403/3:/System/tsd/cr.node.exec.late…] heartbeat failed on epoch increment
E171109 18:31:14.778848 27414385 storage/replica_range_lease.go:197 [timeSeriesMaintenance,n2,s2,r116/3:/System/tsd/cr.store.queue.gc…] mismatch incrementing epoch for {NodeID:1 Epoch:27 Expiration:1510252263.844544343,0 Draining:false Decommissioning:false}; actual is {NodeID:1 Epoch:27 Expiration:1510252268.344588194,0 Draining:false Decommissioning:false}

Sometimes a bunch of new range lease messages (50+ in a row)

And sometimes connection errors trying to connect to the decommissioned node.

Not sure if that’s relevant or not.

Edit: some more query traces: http://termbin.com/7wh9

Something to do with the leases always being lost/transferred?
node status is showing ranges being unavailable for short periods of time: https://pastebin.com/raw/5cULrdh1

@pooper The " slow heartbeat took" message usually means that the disk is
slow (or has bizarre performance characteristics). We’ve seen this on some
cloud vms. What sort of hardware are you running on?

The delete query running for hours is concerning. What’s the schema of your
table? About how big is each row? How many rows are returned by the
SELECT * FROM inputs WHERE queue = $1 query (and deleted by the DELETE FROM inputs WHERE (queue = $1) AND (round < $2) query)?

You might try putting a LIMIT on the DELETE query and running it until it
doesn’t need to delete any rows. Maybe start with LIMIT 100 and tune from
there.

I am using dedicated servers which are only running cockroachdb. 4 cores, 8 threads, 32GB ram. 2 are using SSDs and one is using HDD.

Here are the schemas:

CREATE TABLE queues (
id STRING NOT NULL,
round INT NOT NULL DEFAULT 0:::INT,
start_time TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
end_time TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
supervisor STRING NOT NULL DEFAULT ‘’:::STRING,
last_update TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
timeout INTERVAL NOT NULL DEFAULT ‘10s’:::INTERVAL,
“interval” INTERVAL NOT NULL DEFAULT ‘5s’:::INTERVAL,
task_size INT NOT NULL DEFAULT 500:::INT,
task_timeout INTERVAL NOT NULL DEFAULT ‘10s’:::INTERVAL,
team_size INT NOT NULL DEFAULT 5:::INT,
input_max_age INT NOT NULL DEFAULT 5:::INT,
CONSTRAINT “primary” PRIMARY KEY (id ASC),
FAMILY “primary” (id, round, start_time, end_time, supervisor, last_update, timeout, “interval”, task_size, task_timeout, team_size, input_max_age)
);

CREATE TABLE inputs (
id INT NOT NULL DEFAULT unique_rowid(),
queue STRING NOT NULL,
round_inserted INT NOT NULL,
round INT NOT NULL,
input STRING NOT NULL,
type INT NOT NULL,
size INT NOT NULL,
role STRING NOT NULL,
score INT NOT NULL,
metadata STRING NULL,
deleted BOOL NOT NULL DEFAULT false,
CONSTRAINT “primary” PRIMARY KEY (id ASC),
CONSTRAINT fk_queue FOREIGN KEY (queue) REFERENCES queues (id),
UNIQUE INDEX unique_inputs_idx (queue ASC, input ASC, round ASC),
INDEX queue_round_score_idx (queue ASC, round ASC, score ASC),
INDEX queue_deleted_idx (queue ASC, deleted ASC, input ASC),
FAMILY “primary” (id, queue, round_inserted, round, input, type, size, role, score, metadata, deleted)
);

Around 100-300 rows are returned from inputs. And about the same number are deleted.

By the way, I noticed you have been replying via email. I have actually been editing my replies on the forum with more information and logs so I think you missed those.

Ah, thanks for the heads up. I was indeed responding via email.

The range unavailability is because the node liveness heartbeats are slow. The other things you’re seeing are likely downstream failures fromm that. I’m noticing that our production documentation doesn’t mention this, but CRDB is much happier on nodes with similar performance characteristics. Can you try running it on three nodes all with SSDs or all with HDDs?

I also have similar problem with pooper, although my use case is mostly upsert. My opinion is that we need to be gentle with CRDB now and only do functional test, but not load test. And wait for https://github.com/cockroachdb/cockroach/issues/19699 finished.

I tried again on a new cluster with 3 HDDs and I haven’t had the service latency issue anymore.

However, I still have delete queries stuck and never completing. Is it possible this is caused by the connection being timed out by haproxy while the query is running? Or if the client crashes?

I will try again with a limit on the delete.

Actually, I think my client is timing out because the query never finishes, not the other way around.

When I canceled the stuck queries, the remaining queries continued normally, including the more recent deletes which deleted about 1000 rows in one go.

Glad to hear it! I suspect most of your problems were ultimately derived
from the hardware being heterogeneous. I’ll make sure we call this out in
our documentation.