Insert question

Hello there,

I have tried to perform a batch insert into CRDB using…
INSERT INTO [table_name] (cols) VALUES (v1), (v2), … ;
Data is uploaded @ https://ufile.io/w5c6m (the file is 1.4 MB of text data containing 10k rows in a batch).

schema:

CREATE TABLE product_order_table (
    country STRING,
    region STRING,
    order_id INTEGER,
    id UUID DEFAULT gen_random_uuid(),
    order_date DATE,
	ship_date DATE,
	sales_channel STRING,
	order_priority STRING,
	item_type STRING,
	units_sold INTEGER,
	unit_price FLOAT4,
	unit_cost FLOAT4,
	total_revenue FLOAT8,
	total_cost FLOAT8,
	total_profit FLOAT8,
	PRIMARY KEY(country, order_id, id)
)
Time: 879.121506ms . <--- time from CRDB


real	8m55.326s
user	12m9.521s .  <-- time measured from shell.
sys	0m8.550s
/tmp/product_order/part-00001

The following is what I did to write to CRDB.
time ./cockroach sql --insecure --database=sales < $file

My question…

  1. There seems to quite a bit of overhead time before the insert operation. I am not sure about the 12 minutes of wait time. I do understand that file read takes a bit of time. But this seem quite a bit.
  2. I did notice that if the same data when written to an interleaving table, seem atleast 10x slower.
    To improve performance for interleaving tables, should the insert statements be batched per parent table key?

Please advice,
Muthu

Hey Muthu,

Thanks for letting us know about this - I can confirm something seems wrong, I see the same latency on the command line. I’ve created this issue for tracking, please follow it if you’d like to see progress: https://github.com/cockroachdb/cockroach/issues/26790

As a workaround, you can use psql to bulk insert the data. Using database “test” an a local cluster, I’ve confirmed that this returns you to the command line in the expected 500ms:

psql "postgresql://root@localhost:26257/test" < part0000.sql

Hope that helps - if you run into any issues using psql as a temporary workaround, let me know.

Hello Tim,

Could you also let me know on interleaving table insert, if there may be a difference in performance (in comparison with regular tables)?

I can also confirm that

./psql "postgresql://root@localhost:26257/sales" < 0.04s user 0.02s system 4% cpu 1.282 total
Much better :).

Please advice,
Muthu

1 Like

Hey Muthu,

When inserting into an interleaving table, we recommend batching data on the parent ID. This will result in less range writes overall, and less overhead splitting up batches across many ranges. We also recommend sending reasonably sized batches for multiple parent keys in parallel.

So for instance, if your table has key_ids 1, 2, 3, then you’re better off sending a batch of 1000 for each key_id, then 3000 for key_id=1.

Hope that helps!