Import observations (v1.2-alpha.20171026 (darwin))

  1. I was able to get my data in :smiley:

  2. empty csv files crash the import :disappointed_relieved:

  3. crdb is slightly quicker than postgres (crdb using 1 node (same as postgres))

  4. the import seems to consist out of 2 stages:

    import TABLE trends_uint (
    itemid bigint NOT NULL,
    clock integer NOT NULL,
    num integer NOT NULL,
    value_min numeric(20) NOT NULL,
    value_avg numeric(20) NOT NULL,
    value_max numeric(20) NOT NULL,
    PRIMARY KEY (itemid,clock)
    csv data (‘nodelocal:///Users/ronr/Documents/werk/zabbix/data/trends_uint.csv’)
    with temp=‘nodelocal:///tmp/tmp.txt’, nullif = ‘\N’;
    job_id status fraction_completed rows index_entries system_records bytes
    293697078517694465 succeeded 1 50913362 0 0 451790910

    1 row

    Time: 5m57.847518s

    293697078517694465 RESTORE csv.* FROM ‘nodelocal:///tmp/tmp.txt’ root 12 minutes ago
    Duration: 00:01:28
    293696195666771969 IMPORT TABLE trends_uint (itemid BIGINT NOT NULL, clock INTEGER NOT NULL, num INTEGER NOT NULL, value_min NUMERIC(20) NOT NULL, value_avg NUMERIC(20) NOT NULL, value_max NUMERIC(20) NOT NULL, PRIMARY KEY (itemid, clock)) CSV DATA (‘nodelocal:///Users/ronr/Documents/werk/zabbix/data/trends_uint.csv’) WITH “nullif” = e’\N’, temp = ‘nodelocal:///tmp/tmp.txt’, transform_only root 17 minutes ago
    Duration: 00:04:29

during the first phase the crdb process shows CPU usage of up to 800% on my MBP:sunglasses:
6) the second phase gives about 100% CPU usage

  1. Running the same import in postgres (9.6) gives:
    ronr=# copy trends_uint from ‘/Users/ronr/Documents/werk/zabbix/data/trends_uint.csv’ with (format csv);
    COPY 50913362
    Time: 645972.118 ms

  2. the postgres process hardly touches the 100% CPU usage, mostly beneath that value.
    (during this postgres run, crdb was still consuming some 150% CPU (after completing it’s own import, after which I started the postgres import))

I understand crdb is hard to compare with postgres but for this task I expected better elapsed time based on the CPU spent in phase 1.

There is room for improvement … I am as always willing to help, when needed.

@ik_zelf Thanks for the feedback!

@mjibson and I worked on this feature quite a bit over the last few months, and are really happy to see people trying it out and letting us know how it works for them.

As you identified, it works in two main phases – the first reads the csv data to produce cockroach data and stores that data in a backup, the second restores that backup to load it into your cluster.

I’m happy to hear that the total run time was shorter than expected, and I’m not entirely surprised at the increased total CPU used – we use a good deal of concurrency and pay pretty significant coordination overhead as a result (If you’re familiar with Go, we’re using goroutines and pushing the CSV rows over channels to the workers that convert them, which made it very easy to spread that work around but adds some significant overhead while doing so).

As you identified, phase two isn’t quite as computationally heavy – there’s a good deal of IO re-reading and loading the produced backup, and loading it involves some computation because we do some last-minute transforms of data while it is loaded, but it is generally far less computation than the production of that data initially.

I think we intentionally were rejecting empty files, but I’ll have to let @mjibson weigh in as to whether we likely to change that.

I don’t recall a design decision to support or not support empty files. We can change the behavior about them in the next major release if needed.

I created an issue to track support for importing empty csv files:

@ik_zelf I’m curious, can you tell me more about why and how you attempted to import an empty csv file? My assumption was that if a user specified an empty file, it was probably an error on their part. However if your workflow was intended that way, I’m curious to know some thoughts behind it.

Hi Matt,

depending on how the app is configured and used, there are more or less tables used. So, a simple procedure is to dump all tables and import them all. Doing so will see a few empty tables that create empty csv files.
Keeping it all KISS, would say: just read the empty files inserting all 0 rows😂
and complete the procedure.

Thanks for the feedback. We added support for empty files that will be in the current or next alpha release.