Optimal import of very large datasets

I’ve been actively using IMPORT for a while now and am trying to figure out the most optimal way to import a large amount (100+ TB) of CSV data.

I’ve figured out from experience that putting too much data in a single import is a bad idea and likely to time out. In my case it was importing gzipped csv files from S3 and I would get “connection reset by peer” errors. Regardless of the error cause, errors happen, so many small-ish import files seems like the way to go.

The issue is, supposing I have 32 CRDB nodes, how many files should I run in a single IMPORT statement to optimally use the cluster? I would prefer fewer so it isn’t so costly to do the revert and run it again. But I’m not sure if CRDB imports any faster when importing 1 file vs 32 files or 1000 files.

I’d appreciate any insight into how CRDB distributes import files across the cluster for processing (if it does).

This would also be helpful since I’m using hosts that have 6 disks each (JBOD), and I’m wondering if it would be fastest to actually run 1 CRDB node per disk (so 6 nodes on a single host) just for the import, then after the fact do a rolling reconfiguration/rebalance to migrate the cluster to the recommended method of 1 node using 6 stores. This really depends on how CRDB parallelizes CSV import.

I’m aware of kv.bulk_io_write.max_rate and set this to a higher number, but I generally don’t see CRDB use all the resources available to it on an import.

All of the specified files will be split (round robin) across all nodes. So, your best bet is to have multiple of 32 files of roughly equal size. Assuming 100T, you can have 1600 files of 64GB.

See https://www.cockroachlabs.com/docs/v20.2/import-performance-best-practices.html

Since you’re importing a lot of data, I would suggest creating table(s) first, and then running the import. I would also highly recommend not creating any indexes (other than the primary, of course) on the table
(you can add secondary indexes/constraints later). That’s because primary index is sorted, while secondaries are not – and underlying LSM deals w/ sorted data much better. On that subject, you almost certainly need to run pebble engine w/ cockroach (which should be on by default in the newer cockroach versions).

If you have enough ram available on the nodes:
SET CLUSTER SETTING kv.bulk_ingest.max_pk_buffer_size = ‘2gib’;
SET CLUSTER SETTING kv.bulk_io_write.concurrent_addsstable_requests = 5;

CRDB runs CSV imports in parallel:

  1. Each file distributed to each node, round robin.
  2. While processing a file, there is a single reader, but many converters which emit native encoded data.
  3. The data is buffered in memory, an sstables are added directly to LSM.

Finally, I’m concerned about “connection reset by peer” errors you’ve reported. We have built in protection for these types of errors in google/azure storage implementations, but up until this point we didn’t think we need them for aws. Can you share more information on those errors? Perhaps file a bug report and include whatever information you can share?

Thanks @yevgeniy that’s super helpful. I’ve been in the docs a lot and am very surprised I never came across the page you linked to (maybe it’s a newer doc page?) Knowing that it sends files around round-robin, and what it does within each importer, is awesome.

Also great to know about those secret cluster settings. Is there a place in the docs where we can find the full list of possible cluster settings? This isn’t the first time I’ve been given a helpful cluster setting in this forum that isn’t in the docs.

Regarding “connection reset by peer”, I expected it might be our own firewall killing connections, I’m pull data down from S3 into our datacenter. If CRDB is supposed to be robust to that, then I suppose that is a bug and I’ll file it next time I see it. It has been very easy to reproduce, especially with large files that take hours to import.

Yes, the import performance page is a new-ish. You can use show cluster settings.

However, the reason why these settings are sometimes not documented is because they could be
dangerous to use. For example, setting the one or both of the settings I mentioned too high, and you can oom nodes, and fail the import. We probably could do a better job document some of the less dangerous settings.

The reason why I mention pebble is that pebble addressed some of the performance pain points, particularly around import. You might find this blog post interesting: cockroachlabs.com/blog/bulk-data-import/

My suggestion regarding keeping only primary keys is based on knowledge of inner working of the import… Some of that is described in the blog post.

Re conn reset: I want to know either way. It maybe your firewall, or it maybe s3 itself, I don’t think
it should matter. I’m concerned about the scenario where you open a
stream, but because it’s a huge (e.g. 1T) file, it takes long time to read it. In the meantime, import process might be busy actually importing the data into cockroach, and it stops reading the stream
for a while. Connection gets killed; we should be resilient to that. I just want to confirm if that’s what you’re experiencing. Also, more files would definitely help as it should reduce the probability
of connection reset.

Given the size of your import, another thing I would suggest is to create multiple jobs for the import.
So, the process might be:

  1. Create 1000s of files.
  2. Create table(s): primary index only.
  3. Run IMPORT INTO, importing few hundred files at a time.
  4. Repeat 3
    The reason is that if you run into any of those connection reset errors, you probably don’t want to
    restart from scratch. Currently, when import fails, it tries to undo everything it has done.

FYI show cluster settings doesn’t show those settings you mentioned, as far as I can tell it only shows the publicly documented ones, but they did work.

In any case, thanks for the info and I’m sure it’ll be useful to people googling this stuff later.

I’m currently importing into a 5-node cluster, having the data split into 50GB chunks and then gzipped, and I’m running 5 files per IMPORT statement. I’ll play with those two tuning parameters and see how it goes. These nodes each have about 48vCPU and 3 SSD stores.

The first import, with max_pk_buffer_size and concurrent_addsstable_requests left at their default, took 62 minutes.


For my second attempt I tried some aggressive numbers, max_pk_buffer_size = '2 GiB' and concurrent_addsstable_requests = 10. I definitely don’t recommend this for online clusters since it seemed to cause instability right when it started. Then about an hour in it failed with communication error: rpc error: code = Canceled desc = context canceled and reverted. The performance profile was also very spikey/cyclic. It seemed to use a bit more cpu/disk but not by much, certainly not stressing the machines.

I’ll be doing a few more iterations.

Running with max_pk_buffer_size = '128 MiB' and concurrent_addsstable_requests = 10, it failed after an hour and a half with ERROR: <s3 url of one of my data files>: unexpected EOF. Performance didn’t look that different from the first run, maybe a little worse.

I just tried with more moderate numbers, max_pk_buffer_size = '512 MiB' and concurrent_addsstable_requests = 5, and this time it completed but still took an hour, no change from default values.

@yevgeniy one last question, do you know if decompressing downloaded data is done concurrently with parsing and loading it? I’m wondering if importing uncompressed data would be faster.

I doubt it would be faster to have uncompressed data.

Try setting max_index_buffer_size = ‘512mb’ (or larger). See if that makes a difference.

Unexpected EOF almost certainly indicates connection getting closed. What version of cockroach you’re using?
Do you use temporary credentials for aws?

These are permanent credentials. v20.2.3

I’m importing a single table with just a few columns and no indexes, so would max_index_buffer_size make a difference?

@yevgeniy I just had this error twice in a row: <my s3 url>: read tcp <my IP>->52.219.120.25:443: read: connection reset by peer – if as you say this should not be happening I could send some debug logs. Where should I send those?

One last question as well: would import be sped up by much if I reduce replicas from 3 to 1 and then increase it at the end?

Can you file an issue with the error message and whatever details you have?

Yes, the import will speed up if you have just 1 replica.

I filed a support issue with the debug logs.

Just another quick note, I’ve observed that once I have a fair amount of data, reverting a failed import job takes MUCH longer than the import itself. An import that takes maybe an hour when it succeeds took about 20x as long to revert.