I’m doing a CockroachDB POC where I’m trying to bulk import via CSV a table that is roughly 2 TB in size from Hive that is stored on GCS in shards. This table is a bit under 700 shards, and in the future will likely grow.
I have read that the import command can take a comma-delimited list of URLs (https://www.cockroachlabs.com/docs/v2.1/import.html#import-a-table-from-multiple-csv-files), and I suppose I could perhaps write a script that generated that command after iterating over the available shards. However, that seems rather unwieldy.
Would it support so many URLs? Is there a limit on the URL list size provided? If so, what is that limit?
Does it or would it be possible for the import statement to support a wildcard so that the end-user does not have to dynamically generate the list of shard URLs?
Also, performance wise for the import, is performance better when using multiple shards or a single CSV file, or does that not matter?
For context, the goal of this POC is to see if CockroachDB might be a good replacement candidate for Postgres in a particular use case we have where the fact that we need to export the data from Hive into Postgres presents a substantial bottleneck due to its vertical-only scaling (e.g. the number of CPUs you have on one Postgres box, essentially). We are hoping that the horizontally scalable nature of CockroachDB will allow us to essentially eliminate that bottleneck and streamline the process substantially. This will be a process which is repeated frequently, so the speed and scalability of the import process is of critical importance.