Queries performance issues


(y) #1

Hey all,
I’m trying to run some benchmarks against a randomly - filled cockroach db, with just one table.
My table has a text column, named id, which for now just contains the values “1”… “10000”, another column named accountid (a string as well), and a few more timestamp / date / text colums

I have 2 clusters - one running locally with 3 dockers (following the instructions you posted), and another one runs in a similar way, on AWS cloud.

In both cases, queries like select * from table take 4-5 seconds until they return the first result (my code is written in GO, using the pq driver). Running the query from one of the nodes directly (using your sql client) is just a bit faster, reducing the query to 3-4 seconds.

If i run more specific queries, such as select id from table where accountid = 'myaccount' - it performs better, but only for cases where the index covers the columns im asking for. In all the cases my query included columns not covered by the particular index the db has used, performance was slow again.

Am i missing something very basic? This DB is supposed to be used as a storage for millions of rows, with a basic logical partition by the accountid column. The queries can be based on different permutations of the columns, therefor im not sure it’ll make sense to try and build specific indices for all the cases.

Thanks for your help!


(jordan) #2

Hi @ypatent,

The time-to-first-byte of large queries will be much improved in our upcoming 1.1 release, thanks to a new streaming result feature that was added. Previously, the entire result set was buffered by the gateway node before sending it to the client. You can try that feature out if you’d like in one of our alpha releases.

As far as your other question, could you make your example a little more concrete? What is the latency difference you’re experiencing between a covering and non-covering query? It’s expected that query performance will be worse in this case, since extra key-values must be fetched from the database.

If the performance is unacceptable, consider using an index with stored columns. That colocates a copy of the table data with the index, which should speed up your use case. Take a look at the docs for some more detail.

Jordan


(y) #3

Hey, thanks for your reply!

The difference between covering and non covering is around x10 (300ms for covering vs 3s for non covering).

If i choose to index the table - is there any limit i should be aware of? My table has 10 or so different columns, all of which can be used in the where clauses. Does creating 10 indices sound like an acceptable solution?


(jordan) #4

Hmm, 300ms for an indexed single row retrieval is quite worse than expected. You can use EXPLAIN on a query to check what it’s doing under the hood. Mind sharing the query you’re running, the indexes available, and the EXPLAIN output?


(John A Lauro) #5

Performance hit on too many indexes is generally (SQL in general, not certain about cockroach) is during inserts and updates. If you do not have performance issues with that many indexes on inserts or updates, then adding 10 indices might be the best option if each of those columns are hit on a where clause at different times and without other columns. If multiple columns are included, then it’s probably not the best way to setup the indexes…


(y) #6

Hey all,

This is the full version of the table im querying (what i posted earlier was a shortened version):

  email_id                   STRING    NOT NULL,
  batch_id                   STRING    NULL,
  email_address              STRING    NULL,
  email_type                 STRING    NULL,
  reminder_num               INT       NULL,
  order_id                   STRING    NULL,
  order_timestamp            TIMESTAMP NULL,
  product_id                 STRING    NULL,
  sku                        STRING    NULL,
  review_type                STRING    NULL,
  coupon_code                STRING    NULL,
  email_sent_timestamp       TIMESTAMP NULL,
  email_sent_date            DATE      NULL,
  app_key                    STRING    NULL,
  unsubscribed_timestamp     TIMESTAMP NULL,
  clicked_through_timestamp  TIMESTAMP NULL,
  arrived_early_timestamp    TIMESTAMP NULL,
  opened_timestamp           TIMESTAMP NULL,
  marked_spam_timestamp      TIMESTAMP NULL,
  failed_timestamp           TIMESTAMP NULL,
  invalid_address_timestamp  TIMESTAMP NULL,
  content_type               STRING    NULL,
  content_creation_timestamp TIMESTAMP NULL,
  review_form                STRING    NULL,
  platform                   STRING    NULL,
  CONSTRAINT "primary" PRIMARY KEY (email_id ASC)
);

as of indices, i currently have the following:
index on app_key, which stores (by default i guess) the email_id
another index on app_key and email_sent_timestamp, which also stores the email_id

I have ~100,000 records of random data, all of which use the same app_key - “randomappkey1”.

Now, when i run a query that can be covered entirely by one of the indicies, such as SELECT email_sent_timestamp, count(*) AS c FROM emails WHERE email_sent_timestamp >= '2017-08-18' AND email_sent_timestamp <= '2017-01-13' AND app_key = 'randomappkey1' GROUP BY email_sent_timestamp ORDER BY email_sent_timestamp;, it takes around 0.8 seconds to get the result set back to the client.
This is the execution plan:

0,"sort","",""
0,"","order","+email_sent_timestamp"
1,"group","",""
2,"render","",""
3,"scan","",""
3,"","table","emails@emails_app_key_email_sent_timestamp_idx"
3,"","spans","/""randomappkey1""/2017-01-13T00:00:00Z-/""randomappkey1""/2017-08-18T00:00:00.000001Z"

If to this query i add another condition, not covered by index, such as AND content_creation_timestamp is not null, it takes ~5seconds to get the data back to the client.
This is the execution plan for that query:

0,"sort","",""
0,"","order","+email_sent_timestamp"
1,"group","",""
2,"render","",""
3,"index-join","",""
4,"scan","",""
4,"","table","emails@emails_app_key_idx"
4,"","spans","/""randomappkey1""-/""randomappkey1\x00"""
4,"scan","",""
4,"","table","emails@primary"

So, to my understanding, the issues can solved by changing the way buffering, you mentioned in the last response, works and by indexing completely all the fields required by queries (which will always include app_key and email_sent, and can also optionally include one of the other timestamp fields, and the email_type and reminder num fields)