Query extremely slow when joining

sql

(Elliot C.) #1

I have two tables; Products and Variations. A simplified version of the tables:

CREATE TABLE products (
  product_id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  account_id                  UUID NOT NULL,
  product_title               STRING(500) NULL,
  INDEX ix_products_primary (product_id,account_id) STORING (product_title),
  INDEX ix_products_secondary (account_id) STORING (product_id),
  CONSTRAINT fk_products_account_id FOREIGN KEY (account_id) REFERENCES accounts (account_id) ON DELETE RESTRICT
);

CREATE TABLE variations (
  variation_id                UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  account_id                  UUID NOT NULL,
  product_id                  UUID NOT NULL,
  sku                         STRING(50) NOT NULL,
  INDEX ix_variations_primary (variation_id,product_id) STORING (sku),
  CONSTRAINT fk_variations_account_id FOREIGN KEY (account_id) REFERENCES accounts (account_id) ON DELETE RESTRICT
  CONSTRAINT fk_variations_product_id FOREIGN KEY (product_id) REFERENCES products (product_id) ON DELETE RESTRICT
);

I have about 11 million rows in each table and when I query each table individually with one of the queries below I get a response in ~20ms.

SELECT
  products.product_id
FROM products
WHERE
  products.account_id='5cf9c65a-6dec-4ac6-96a6-5dc562b50c73'
LIMIT 10;

SELECT
  variations.variation_id
FROM variations
WHERE
  variations.account_id='5cf9c65a-6dec-4ac6-96a6-5dc562b50c73'
LIMIT 10;

But if I combine the query it takes ~30 seconds.

SELECT
  products.product_id,
  variations.variation_id,
  variations.sku
FROM products
INNER JOIN variations ON variations.product_id=products.product_id
WHERE
  products.account_id='5cf9c65a-6dec-4ac6-96a6-5dc562b50c73'
LIMIT 10;

And this is the explain for the join query

limit		
 └── render		
      └── join		
           │	type	inner
           │	equality	(product_id) = (product_id)
           ├── scan		
           │	table	products@ix_products_secondary
           │	spans	/"\\\xf9\xc6Zm\xecJƖ\xa6]\xc5b\xb5\fs"-/"\\\xf9\xc6Zm\xecJƖ\xa6]\xc5b\xb5\fs"/PrefixEnd
           └── scan		
    table	variations@primary
    spans	ALL

What can I change about my query or what index’s should I create to help speed this up? I’m quite stumped.

Thank you very much


(Raphael 'kena' Poss) #2

Hi Elliott,

before I can help you further, can you inform us whether the account_id column in table variations contains the same information as in products? That is, is the following query equivalent:

SELECT
  products.product_id,
  variations.variation_id,
  variations.sku
FROM products
INNER JOIN variations ON variations.product_id=products.product_id
WHERE
  products.account_id='5cf9c65a-6dec-4ac6-96a6-5dc562b50c73'
AND variations.account_id='5cf9c65a-6dec-4ac6-96a6-5dc562b50c73'
LIMIT 10;

(Elliot C.) #3

Yes that is correct, while it is redundant I figured having that data present on both records would help when I would want to query one or the other individually.


(Raphael 'kena' Poss) #4

Ok then just to start, can you confirm that adding the condition on account_id on both the variations and the products table somewhat alleviates your performance issue? We can certainly look further but I want to verify we are on the right path.


(Elliot C.) #5

This is the explain for the query you posted.

 limit
     └── render
          └── join
               │	type	inner
               │	equality	(product_id) = (product_id)
               │	mergeJoinOrder	+"(product_id=product_id)"
               ├── scan
               │	table	products@ix_products_secondary
               │	spans	/"\\\xf9\xc6Zm\xecJƖ\xa6]\xc5b\xb5\fs"-/"\\\xf9\xc6Zm\xecJƖ\xa6]\xc5b\xb5\fs"/PrefixEnd
               └── scan
    	table	variations@ix_variations_primary
    	spans	/"\\\xf9\xc6Zm\xecJƖ\xa6]\xc5b\xb5\fs"-/"\\\xf9\xc6Zm\xecJƖ\xa6]\xc5b\xb5\fs"/PrefixEnd

This confused me even more. I used the exact query that you posted and the first time or two I ran the query I got a response in ~150ms. But now it’s back up to 2-3 seconds?

Edit:
It seems if I disconnect and reconnect the query time drops back to ~150ms.

Edit 2:
It seems inconsistent on when it is and isn’t slow to respond, what other things can I check to help diagnose this issue?


(Raphael 'kena' Poss) #6

I’m paging in some colleagues who may be able to help you further.


(Andrew Dona-Couch) #7

Glad to hear we’re (at least sometimes) getting better performance there. As you can see from the last scan node in the two query plans, adding the condition on account_id allows the database to jump straight to the relevant rows rather than doing a full table scan. Future improvements to joins (using a lookup join instead of a merge join) should help make this even faster, but it doesn’t look like that’s an option right now.

I can’t comment on the inconsistency, but I’m confident that @knz and the experts he pulls in can get to the bottom of it.


(Elliot C.) #8

I think I’ve found the inconsistency.

I have a cluster of 3 towers and my code is targeting a loadbalancer that just does a “round-robin” between the three nodes.

I’ve found that when I run the query directly against nodes 1 and 2 the query is extremely slow. But when I run the query directly against node 3 it’s extremely fast. 50-150ms now.

Is this an issue with how I have my data replicated between the nodes?


(Andrew Dona-Couch) #9

It will be useful to have a few more details about your setup:

  • What is the network setup for the cluster, particularly what’s the latency between the three nodes? (You can see a latency report by going to /#/reports/network on the web UI)
  • What version of CockroachDB are you running, and how are you starting the nodes?
  • Have you configured the replication in some way or are you running with defaults?
  • Is there anything else unique about your configuration?

(Elliot C.) #10

I’m running v2.0.0 on all 3 of my nodes. And I’m starting the nodes using the following command in /etc/rc.local

cockroach start --insecure --store=/home/ready/scale-node1 --host=192.168.0.20 --port=26257 --http-port=8080 --background --join=192.168.0.16:26257,192.168.0.17:26257,192.168.0.20:26257

And that command is on all 3 nodes (Ubuntu 16.04) with the only difference being the host IP.

I’m using default settings for everything.

And there is nothing unique that I am aware of?

Edit:
n5 is the node that responds quickly to that query, the other 2 nodes are the slow ones.


(Andrew Dona-Couch) #11

Thanks for the details, nothing really stands out to me here. I’m going to need to find someone with deeper knowledge of this to help investigate further.


(Andrew Dona-Couch) #12

One other report you could look at is the Problem Ranges: /#/reports/problemranges. If anything is non-zero that might point to the source of your issue.


(Elliot C.) #13

It looks like there isn’t.

Something else I’m about to try, I’ve not power cycled these towers in probably a 5 or 6 months. I’ve stopped and started Cockroach on all 3 of the towers multiple times during that time though. Would doing regular power cycles on these machines possible help this? n5 was the last one to be completely shutdown and restarted; granted that was several months ago.


(Andrew Dona-Couch) #14

I’d be really surprised if power-cycling the machines made the query run faster, but I suppose it doesn’t hurt to try and see?

I’m still trying to find someone who can help debug this, unfortunately most of our team is at an offsite this week, so they’re available only inconsistently.


(Elliot C.) #15

It did not help unfortunately. I really appreciate the help so far though, hopefully we’ll be able to figure it out.


(Elliot C.) #16

Is there anything else I can try in the mean time? Or what things could be causing this issue that are independent of CockroachDB? Like a hardware issue or something?


(Elliot C.) #17

So I ran SHOW TRACE FOR on the query and I’m getting this message.

pq: root: memory budget exceeded: 7821168 bytes requested, 132150848 currently allocated, 134217728 bytes in budget

While running the query might take less memory than tracing the query, is it possible that the slowness of the query is due to insufficient memory on my machines? And if that is the case what would be the recommended memory for a queries joining on tables with 11 million rows each?