Query does not end

This is a follow up of my previous thread: High service latency

This occurred on a single node running with almost no load. A delete query was running on a table with 7 rows in it, every 5-10 seconds, and no rows are matched by the delete query so nothing was being deleted.

After running for several hours, I got the same issue I had in the previous thread, where the queries do not finish.

Here is the result of show queries: https://pastebin.com/raw/69DeEqnd

Along with queries that do not respond anymore, it seems there are many duplicates of the same query with the same timestamp, even though there should only be one for each timestamp.

I have since killed my client so there should be zero activity, but more duplicates of these queries continue to show up in show queries, with the same timestamps as before: https://pastebin.com/raw/mXNe2V9k (truncated from 200 rows)

Running the same query manually does not have any issues.

At the same time when the queries stopped responding, the only messages in the log that looked out of place (eveything else was regular info logs), were some warnings which are still occurring: https://pastebin.com/raw/nrs4ujRE

Does anybody have any insight into my problem?

Thanks

Additional info: I manually cancelled three random queries and it cleared all the remaining queries: https://pastebin.com/raw/4wCHsEsd

There were over 300 queries in the result of show queries.

Hi Pooper,

This is indeed very strange behavior. I’m curious about your observation that it doesn’t happen when you run the query manually. Can you provide more details about the client configurations you’re using?

I’m assuming by running manually you mean that you’re running it through cockroach sql. Are you using a client driver or ORM the rest of the time?

Thanks,
Andrew

Yes, I meant cockroach sql.

My client is nodejs using the node-postgres library. I am not using an orm.

Reducing GC TTL to 10 minutes seems to have fixed my issues.

Hello again Pooper,

Glad to hear that reducing the GC interval seems to fix your issue. That’s also helpful information for us.

We’d still like to get to the bottom of this. I’ve so far been unable to repro it myself using the queries from your other post. Can you confirm if the schema and DELETE query are the same as you wrote there? You said here “almost no load”, can you clarify what other load is on this cluster? Also, you said no rows are matched by the delete query. Is that the case every time, or are there sometimes rows to delete?

Thanks for your patience here, hopefully we can get a firm answer for you on this.

Andrew

Schema is the same, yes. The query is also the same, but it isn’t always stuck on the same query. Sometimes it’s a different delete query, sometimes a select.

“Almost no load” means a couple of selects and updates every couple of seconds.

In the case that prompted this thread, no rows were being deleted at any time. But I had this issue before when there were rows to be deleted.

I’m able to force the issue pretty reliably now. Here is my test script: https://pastebin.com/raw/Tx5mXUkg
I run 4-5 instances of this, and the scripts get stuck awaiting a response almost immediately.
And here is the result of show queries after I kill the scripts and wait a couple of minutes: https://pastebin.com/raw/vc4k9vgd
And 7 minutes later: https://pastebin.com/raw/eb96zYCS

Running the exact same query with cockroach sql also gets stuck. Adding a limit of 10 to the delete query is successful, but a limit of 100 gets stuck.

Thanks for providing that test script, that should be very helpful as we try to track this down. Interestingly, I am still unable to repro the behavior you’re seeing. I’ve had eight instances of your script running against a three node cluster on my little laptop for twenty minutes, and it looks like the 99th percentile latency has settled in at about 800ms. None of the scripts seem to be getting stuck. I’ll keep this load running for a little while to see if I can get it to happen to me.

Let’s assuming there’s some other discrepancy between our configurations. Can you confirm the version of CockroachDB you’re using?

Also, let’s gather a bit more information about what’s happening when the queries are stuck. Could you force the issue, then screenshot these pages in the admin UI:

http://cockroach-node:8080/#/reports/nodes
http://cockroach-node:8080/#/reports/problemranges

Perhaps one of them will have some clues as to what’s going wrong.