Testing large data sets

I’m trying to evaluate CockroachDB for a production database, however I’m struggling to get large datasets (10^7 rows) into candidate table structures. For example, in order to evaluate query performance I need the table to have a primary key on it, seeing as a PK must be specified when creating the table if I want to use the IMPORT option then I can’t evaluate the read performance.

I’ve tried importing into a PKless copy of the table and then INSERT…SELECT into the final table, but I get the error:
transaction is too large to complete; try splitting into pieces

However, I’ve also tried chunking the query down, but it in order for it to be happy with the size it would literally take days to port the data over, which is unacceptable.

An example structure is:

CREATE TABLE IF NOT EXISTS ckm_dimjsonb (
id INT8 NOT NULL DEFAULT nextval(‘cell_key’),
composite_key JSONB NOT NULL,
revision INT8 NOT NULL,
value DECIMAL(18,8) NULL,
CONSTRAINT ckm_pk PRIMARY KEY (id ASC, revision DESC));

For each incrementing value in the “revision” column there are ~46K rows, and there are 215 revisions in the test dataset.

Do you have any suggestions on how I can achieve this?

Thanks

Hey @mark_balmer,

I have a few questions.

  1. What version of CRDB are you running?
  2. What’s your hardware setup? AWS/GCE/local? what kinds of disks are you using, how many CPU’s do your machines have? Are you running just one node or multiple?
  3. Is there a particular reason that you can’t import using a primary key? Did you receive and error?
  4. How big is the JSON payload?

Thanks,

Ron

Hey @ronarev,

  1. The db version is v2.1.5
  2. The db is on a GKE cluster with 5 nodes. Non-SSD disks at the moment and each node is using the n1-highcpu-16 cpu config.
  3. I thought using the IMPORT command would not allow primary key definitions, is this not true?
  4. The JSON data in each row is tiny, e.g: {“1”: 1, “2”: 24, “3”: 16, “4”: 31888, “5”: 15480}, although it could be potentially be 5x this in future.

Also, is it possible to import with that table structure given that the id column is from a sequence?
Thanks

Hey @mark_balmer,

Thanks for answering those questions.

What are your expectations regarding performance? Also, you can use import even if you have a primary key, could you try using IMPORT and let us know how that goes?

Thanks,

Ron

Thanks, I must have mis-read the primary key definition during import. I’ve managed to import the large dataset now.

Hey @mark_balmer,

Sounds good! If you have any other questions please feel free to reach out!

Thanks,

Ron