Support for importing shared GCS files?

Hi:

I’m doing a CockroachDB POC where I’m trying to bulk import via CSV a table that is roughly 2 TB in size from Hive that is stored on GCS in shards. This table is a bit under 700 shards, and in the future will likely grow.

I have read that the import command can take a comma-delimited list of URLs (https://www.cockroachlabs.com/docs/v2.1/import.html#import-a-table-from-multiple-csv-files), and I suppose I could perhaps write a script that generated that command after iterating over the available shards. However, that seems rather unwieldy.

Would it support so many URLs? Is there a limit on the URL list size provided? If so, what is that limit?

Does it or would it be possible for the import statement to support a wildcard so that the end-user does not have to dynamically generate the list of shard URLs?

Also, performance wise for the import, is performance better when using multiple shards or a single CSV file, or does that not matter?

For context, the goal of this POC is to see if CockroachDB might be a good replacement candidate for Postgres in a particular use case we have where the fact that we need to export the data from Hive into Postgres presents a substantial bottleneck due to its vertical-only scaling (e.g. the number of CPUs you have on one Postgres box, essentially). We are hoping that the horizontally scalable nature of CockroachDB will allow us to essentially eliminate that bottleneck and streamline the process substantially. This will be a process which is repeated frequently, so the speed and scalability of the import process is of critical importance.

Thanks!

Hi @ken_h,

Welcome to the Cockroach Forums!

Sorry for the delay in response.

Regarding your questions about the IMPORT command URL list:

  • In version 19.2.3, I was able to import 706 files explicity listed. There is no known limit for that list size.

  • In version 19.2.3, I was not able to use a wildcard in the URL list. This feature is expected to be included in the upcoming major release 20.1.

  • Import performance depends on the number of nodes in the cluster, the number of files, and the size of the files. Increase the number of files to equal the number of nodes in the cluster. After that, more files just add overhead (which can become a problem in really long-running imports). For example: If you have a 5 node CockroachDB cluster, 5 x 2gb files will import much faster than 1 x 10tb file. But 10 x 1gb files won’t be any faster than the 5 x 2gb, and 1000 x 10mb files would actually be worse.

Please let me know if the above information answer your questions.

Regards,
Florence
Technical Support Engineer

Hi, @florence-crl:

Thanks for the response. I’m glad to hear wildcarding is coming in a future release.

Your information regarding import performance has helped me to think about the issue versus our needs. It sounds like you’re saying that, ideally, we should have a roughly 700 node CockroachDB cluster for this case, which, certainly, would be a pretty substantial cluster. Is that correct?

One thing that may be somewhat unique about our case is that, while we need extremely large horizontal scaling to import the data into CockroachDB, our needs when actually serving up the data are likely to be much, much more modest than the needs during data ingestion.

Would it be realistic to scale up a very large (say, 700 node) Kubernetes CockroachDB cluster for data ingestion and then scale it dramatically back for the general serving of data? Is that something CockroachDB could handle/support in terms of dynamic scaling? (I would assume this would entail a substantial shuffling of data across the fewer nodes.)

Thanks,
-Ken