Hi there,
I’m looking for any help to better understand select performance I’m observing on our test crdb cluster.
In terms of setup there are 15 nodes, 4CPU, 15G RAM and 1TB SSD each running on crdb 2.1.5, on kuberentes cluster on GCP.
There’s a single database with a couple of thousands of tables, each having a very simple schema:
key - integer, basically just a timestamp as a number
payload - just a JSONB blob
ingested data is pretty much is of time series shape, that is keys are mostly sequential and monotonically incremented, with occasional out-of-order updates.
each table has anywhere betwen couple of hundres to hundreds of millions of records, about 1TB of raw data in total (3TB with replication).
The type of select queries we run are very simple, they always involve key
field, which is primary key, for example:
SELECT key,payload FROM <table> WHERE key <= 1551419999000 ORDER BY key DESC LIMIT 1;
SELECT key,payload FROM <table> WHERE key >= 1551023760000 order by key limit 100;
or just retrieve the last record:
SELECT key,payload FROM <table> ORDER BY key DESC LIMIT 1;
that is most of our queries simply select contiguous ranges of records starting from some key constrained by some inequality filter.
When the cluster was not loaded, these queries were pretty fast, tens of milliseconds on average.
However when we added some load, ~500 readers and writers, the select performance dropped significantly. Now a query, when run for the first time can execute for seconds or tens of seconds, when run subsequently it’s faster, presumably because of cache, but in our use cases queries vary a lot.
So I guess my question is: is this something one would expect given our situation (data amount and shape)? That is maybe it doesn’t make sense to expect sub-second queries with crdb (again, in our specific case) at all? Or is there anything we can do to improve the performance, either on query or schema levels?
Not sure what kind of information might be required from my side, please let me know if there’s anything I can provide to help debugging this issue.
Thank you!
P.S. Attaching a screenshot with the query rate and latency
The huge peak between 21:50 and 22:00 was due to increased number of readers we run for a test.