Issues in Bulk inserts & Data distribution

I have installed CockroachDB on a linux server, and inserted sample data to a table. Meanwhile I came across 2 main problems. Command used to start the single node is below.

  1. Bulk Insert statements like below fails when there are large number of records(100,000). Any suggestions on how to achieve this?

    root@:26257/> INSERT INTO my_db.target
    (a,b,c,…)
    SELECT
    (a,b,c,…)
    FROM my_db.landing;
    pq: command is too large: 153279228 bytes (max: 67108864)

  2. Using an IMPORT statement, loaded a 80GB data set to a landing table from a file. Althogh the node is configured with multiple HDDs, only one HDD was used to store the data. How to distribute data across multiple HDDs withina single node?

Command used to start the node. Note that this test was done using a single cockroach node.
cockroach start
–insecure
–host=localhost
–store=/datan1/cockroachdb,attrs=hdd
–store=/datan2/cockroachdb,attrs=hdd
–store=/datan3/cockroachdb,attrs=hdd
–store=/datan4/cockroachdb,attrs=hdd
–store=/datan5/cockroachdb,attrs=hdd
–store=/datan6/cockroachdb,attrs=hdd
–store=/datan7/cockroachdb,attrs=hdd
–store=/datan8/cockroachdb,attrs=hdd
–store=/datan9/cockroachdb,attrs=hdd
–store=/datan10/cockroachdb,attrs=hdd
–attrs=ram:96gb
–cache=25%
–max-sql-memory=25%
–locality=region=lk,datacenter=pdc
–log-dir=/datan1/cockroach_log
–background

Hi @shanenleen, thanks for your questions.

We’d suggest batching this sort of bulk insert, and choosing a batch size by experimenting with the performance of various batch sizes. See https://www.cockroachlabs.com/docs/stable/performance-best-practices-overview.html#use-multi-row-insert-statements-for-bulk-inserts-into-existing-tables and https://www.cockroachlabs.com/docs/stable/import-data.html#import-from-generic-sql-dump :

Grouping each INSERT statement to include approximately 500-10,000 rows will provide the best performance. The number of rows depends on row size, column families, number of indexes; smaller rows and less complex schemas can benefit from larger groups of INSERTS, while larger rows and more complex schemas benefit from smaller groups.

You could also use IMPORT directly, since that will be the fastest method.

I believe that by default we won’t put replicas on different stores of the same node. Since replication is the way CockroachDB handles rebalancing, this basically prevents it from rebalancing on your single-node cluster. From https://www.cockroachlabs.com/docs/stable/frequently-asked-questions.html#how-does-cockroachdb-scale :

When your cluster spans multiple nodes (physical machines, virtual machines, or containers), newly split ranges are automatically rebalanced to nodes with more capacity.

I’d recommend running three nodes, ideally on different hardware. This will allow the cluster to rebalance the ranges, and will have the additional benefit of being resilient to node failure.

Hi @couchand, thanks for the reply and for the useful links.

This will be a very cumbersome process when we have billions of records. IMPORT might not be the ideal solution when it comes to recovering. May I know whether there are any solutions/improvements regarding this in your future road-map?

Could you help me better understand your concern here? What about IMPORT is less than ideal for your use case? Which of your questions are you looking for solutions and improvements on?