I’ve been actively using IMPORT for a while now and am trying to figure out the most optimal way to import a large amount (100+ TB) of CSV data.
I’ve figured out from experience that putting too much data in a single import is a bad idea and likely to time out. In my case it was importing gzipped csv files from S3 and I would get “connection reset by peer” errors. Regardless of the error cause, errors happen, so many small-ish import files seems like the way to go.
The issue is, supposing I have 32 CRDB nodes, how many files should I run in a single IMPORT statement to optimally use the cluster? I would prefer fewer so it isn’t so costly to do the revert and run it again. But I’m not sure if CRDB imports any faster when importing 1 file vs 32 files or 1000 files.
I’d appreciate any insight into how CRDB distributes import files across the cluster for processing (if it does).
This would also be helpful since I’m using hosts that have 6 disks each (JBOD), and I’m wondering if it would be fastest to actually run 1 CRDB node per disk (so 6 nodes on a single host) just for the import, then after the fact do a rolling reconfiguration/rebalance to migrate the cluster to the recommended method of 1 node using 6 stores. This really depends on how CRDB parallelizes CSV import.
I’m aware of kv.bulk_io_write.max_rate
and set this to a higher number, but I generally don’t see CRDB use all the resources available to it on an import.