Optimizing select performance


(Max) #1

Hi there,

I’m looking for any help to better understand select performance I’m observing on our test crdb cluster.

In terms of setup there are 15 nodes, 4CPU, 15G RAM and 1TB SSD each running on crdb 2.1.5, on kuberentes cluster on GCP.

There’s a single database with a couple of thousands of tables, each having a very simple schema:
key - integer, basically just a timestamp as a number
payload - just a JSONB blob

ingested data is pretty much is of time series shape, that is keys are mostly sequential and monotonically incremented, with occasional out-of-order updates.

each table has anywhere betwen couple of hundres to hundreds of millions of records, about 1TB of raw data in total (3TB with replication).

The type of select queries we run are very simple, they always involve key field, which is primary key, for example:

SELECT key,payload FROM <table> WHERE key <= 1551419999000 ORDER BY key DESC LIMIT 1;
SELECT key,payload FROM <table> WHERE key >= 1551023760000 order by key limit 100;

or just retrieve the last record:

SELECT key,payload FROM <table> ORDER BY key DESC LIMIT 1;

that is most of our queries simply select contiguous ranges of records starting from some key constrained by some inequality filter.

When the cluster was not loaded, these queries were pretty fast, tens of milliseconds on average.
However when we added some load, ~500 readers and writers, the select performance dropped significantly. Now a query, when run for the first time can execute for seconds or tens of seconds, when run subsequently it’s faster, presumably because of cache, but in our use cases queries vary a lot.

So I guess my question is: is this something one would expect given our situation (data amount and shape)? That is maybe it doesn’t make sense to expect sub-second queries with crdb (again, in our specific case) at all? Or is there anything we can do to improve the performance, either on query or schema levels?

Not sure what kind of information might be required from my side, please let me know if there’s anything I can provide to help debugging this issue.

Thank you!

P.S. Attaching a screenshot with the query rate and latency

The huge peak between 21:50 and 22:00 was due to increased number of readers we run for a test.


(Rich Loveland) #2

Hi @syhpoon,

It sounds like you might be running into some transaction contention due to using contiguous ranges of integers as your primary key.

We have a bit of documentation on this at understanding and avoiding transaction contention.

Quoting from that page:

Transaction contention occurs when the following three conditions are met:

  • There are multiple concurrent transactions or statements (sent by multiple clients connected simultaneously to a single CockroachDB cluster).
  • They operate on the same data, specifically over table rows with the same index key values (either on primary keys or secondary indexes, or via interleaving) or using index key values that are close to each other, and thus place the indexed data on the same data ranges.
  • At least some of the transactions write or modify the data.

It could explain why the read performance is OK, but things go badly once you start doing writes.

There are multiple suggestions in the doc for how to address this. Again quoting from that page:

To avoid contention, multiple strategies can be applied:

  • Use index key values with a more random distribution of values, so that transactions over different rows are more likely to operate on separate data ranges. See the SQL FAQs on row IDs for suggestions.
  • Make transactions smaller, so that each transaction has less work to do. In particular, avoid multiple client-server exchanges per transaction. For example, use common table expressions to group multiple SELECT and INSERT / UPDATE / DELETE / UPSERT clauses together in a single SQL statement.
  • When replacing values in a row, use UPSERT and specify values for all columns in the inserted rows. This will usually have the best performance under contention, compared to combinations of SELECT , INSERT , and UPDATE .
  • Increase normalization of the data to place parts of the same records that are modified by different transactions in different tables. Note however that this is a double-edged sword, because denormalization can also increase performance by creating multiple copies of often-referenced data in separate ranges.
  • If the application strictly requires operating on very few different index key values, consider using ALTER ... SPLIT AT so that each index key value can be served by a separate group of nodes in the cluster.

I hope that helps!


(Ron Arévalo) #3

Hey @syhpoon,

Just wanted to follow up as well regarding the example queries you provided.

Taking a look at the EXPLAIN for each, the first two examples with the LIMIT are scanning through a lot of data and if you are querying with a recent datestamp you’ll end up scanning a lot of historical data. The last query you provided scans through the entire table. This last query will block the whole table and any inserts under any concurrency. Could you provide some more information on what you’re trying to do and what those queries are needed for specifically? Having more information on what your’e looking to accomplish will allow us to provide a better solution.

Could you also try running those queries using As Of System Time to see if this improves the performance a bit? I would say '10s' would be a good place to start.

Thanks,

Ron


(Max) #4

Hey Ron,

So our flow is pretty simple, there’s a single writer producing data to a single table. Most of the data is produced in batches with ever increasing key value. That is the writer first inserts the first batch, say with keys from 1 to 1000, then after some period (seconds to hours) it inserts another batch, with keys from, say 1300 to 2500, and so forth. Occasionally some records may be inserted with keys lower than the most recent ones, so, say, a record with key 900 may be inserted after 2500 but most of the time key will be inserted in an increasing order.

There can be one or more readers for every table, they start from the beginning (key value-wise) and move towards the most recent records. That is when a new reader connects, it executes something like SELECT key,payload FROM <table> WHERE key >= 0 order by key limit 1000;
and then it does the same but replaces 0 in condition with the key of the last received record and so on until it reaches the most recent record, at which point it sleeps for a while and then retries. Effectively polling for any new records.

In addition to these queries above, readers may ask for the latest available record atm, and this is why we need SELECT key,payload FROM <table> ORDER BY key DESC LIMIT 1;, are you saying this requires a full table scan? Even though it’s using a primary key index? Is there maybe a more effective way to retrieve the last record in this case?

Running a query for the last record with AS OF SYSTEM TIME '-10s' doesn’t seem to make much difference, the very first time I run it, the execution time can be 1.5s, and once I repeat it it’s ~30ms.


(Max) #5

Hey Rich,
Thank you for the info! It may indeed be the problem. Even though there are not many queries with exactly the same primary key, most of the queries operate on the latest data, which, I presume, translates into the same data range on crdb site.

Is there maybe a graph (or any other visible metric) which could show if this is indeed what’s going on?


(Rich Loveland) #6

most of the queries operate on the latest data , which, I presume, translates into the same data range on crdb site.

That’s my hypothesis too, especially given that you said “there’s a single writer producing data to a single table.”

Is there maybe a graph (or any other visible metric) which could show if this is indeed what’s going on?

Unfortunately we don’t yet have a graph for that, but see https://github.com/cockroachdb/cockroach/issues/26611 for an issue tracking that and other visualizations that are TBD.

For now it might help to use SHOW EXPERIMENTAL_RANGES FROM <table> (docs here). It will show you which tables map to which ranges, which might provide proof of the above hypothesis.


(Max) #7

So after some additional profiling, it seems that the main latency issue is with queries of type SELECT key,payload FROM <table> ORDER BY key DESC LIMIT 1 OFFSET 0, that is requesting the latest available record when ordered by key. I assumed that because there’s a primary key (index) and a small limit involved this query should be very fast, but apparently it’s not that simple. In general is there a way to optimize (either on schema or crdb settings level) for such a request type?


(Rich Loveland) #8

the main latency issue is with queries of type SELECT key,payload FROM <table> ORDER BY key DESC LIMIT 1 OFFSET 0

I think the culprit for this query in particular is a combination of:

  • Schema (using INT primary key leads to hotspots, as mentioned above)
  • Unnecessary use of OFFSET (if you just want the latest N records) which appears to be affecting index selection / query plan, at least for the schema I propose below

is there a way to optimize (either on schema or crdb settings level) for such a request type?

I think so, but please let me know what you think. I played around with your schema a bit and here’s what I came up with. It’s a variation that:

  1. uses UUID primary key to sprinkle data across multiple disks (for better write perf/less contention)
  2. adds an INT “timestamp” value which you needed for the expected output
  3. adds an index on the above INT timestamp value (to make the selection query faster)

Below, we will see that removing the OFFSET (which AFAICT does not affect the output), makes the query plan more efficient (less depth plus index usage – see more below).

The new schema looks like (roughly):

CREATE TABLE foo (
       id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
       tstamp INT,
       payload JSONB
       );

Inserting some dummy data:

INSERT INTO foo (tstamp, payload) VALUES
(1551805281, '{"bar": "baz01"}'),
(1551805282, '{"bar": "baz02"}'),
(1551805283, '{"bar": "baz03"}'),
...

Then I ran an EXPLAIN against a query similar to your sample query, but against this new schema, and it looks like it will do a full table scan in this case, which we don’t want (it’s slow).

> EXPLAIN SELECT tstamp, payload FROM foo ORDER BY tstamp DESC LIMIT 1 OFFSET 0;
      tree      | field  | description 
----------------+--------+-------------
 limit          |        | 
  │             | count  | 1
  │             | offset | 0
  └── sort      |        | 
       │        | order  | -tstamp
       └── scan |        | 
                | table  | foo@primary
                | spans  | ALL

I added an index on tstamp to try to avoid the full table scan:

CREATE INDEX ON foo (tstamp);

However, it’s still doing a full table scan (I think due to OFFSET).

EXPLAIN SELECT tstamp, payload FROM foo ORDER BY tstamp DESC LIMIT 1 OFFSET 0;
      tree      | field  | description 
----------------+--------+-------------
 limit          |        | 
  │             | count  | 1
  │             | offset | 0
  └── sort      |        | 
       │        | order  | -tstamp
       └── scan |        | 
                | table  | foo@primary
                | spans  | ALL
(8 rows)

Without OFFSET, it looks like we DO use the new tstamp index, which should be faster. AFAICT if we just want “the latest N values”, the offset isn’t needed (unless I misunderstood your use case?).

EXPLAIN SELECT tstamp, payload FROM foo ORDER BY tstamp DESC LIMIT 1;
     tree     | field |    description     
--------------+-------+--------------------
 index-join   |       | 
  │           | table | foo@primary
  └── revscan |       | 
              | table | foo@foo_tstamp_idx
              | spans | ALL
              | limit | 1
(6 rows)

Note in addition to using an index, this query plan uses fewer “levels” than the earlier one without an index. Per the EXPLAIN docs:

  • Queries with fewer levels execute more quickly. Restructuring queries to require fewer levels of processing will generally improve performance.
  • Avoid scanning an entire table, which is the slowest way to access data. You can avoid this by creating indexes that contain at least one of the columns that the query is filtering in its WHERE clause.

Would you be willing to try a schema/index setup like this against your test cluster and let us know how it goes?


(Max) #9

How would I query for the Nth record from the end without offset? For the very last one I can omit offset of course.

on my current schema, this is what EXPLAIN shows:

   tree   | field |               description
+---------+-------+------------------------------------------+
  revscan |       |
          | table | <table>@primary
          | spans | ALL
          | limit | 1
(4 rows)

which is even fewer levels than with additional indexed field :thinking:

So, because the usage pattern will stay the same (readers will be polling for the newest records and writer will be appending to the same region) the only major difference is that ids will be random and more uniformly distributed. Hopefully this can make some visible difference :\

I’m going to try to change the schema as you suggested, however I’d need to re-import all the data from the beginning and it’s going to take a while before it’s done and I could compare the performance difference.

Thank you very much, Rich, for your help! I really appreciate it.


(Rich Loveland) #10

Thank you very much, Rich, for your help! I really appreciate it.

You’re welcome! I think there may be a couple of errors in my attempts at EXPLAIN tea-leaf-reading above, but I hope better data distribution of writes with a new schema will get you where you want to go. If it doesn’t, please let us know.