Incomplete import

I’m importing a 6 million line tsv file, but only 4.5 million records are stored in the table.
I’m doing it entirely on my laptop, single node, from inside the Cockroach repl.
Creating a tsv file with a single record, missing from the table, imports ok.
Creating a tsv file with ten records, including a record missing from the table, imports ok.
Converting the tsv file to a csv file imports 100% ok.
Every record in the file has the correct number of \t and an even number of double quotes.
The command I use to import is:
> import table title_basics (
tconst text primary key,
title_type string,
primary_title string,
original_title string,
is_adult int,
start_year int,
end_year int,
runtime_minutes int,
genres text
) CSV DATA (‘nodelocal:///title.basics.tsv’)
with delimiter = e’\t’
, skip = ‘1’
, nullif = ‘\N’
;

The result shows *succeeded*:
        job_id       |  status   | fraction_completed |  rows   | index_entries | system_records |   bytes
+--------------------+-----------+--------------------+---------+---------------+----------------+-----------+
  478675833741344769 | succeeded |                  1 | 4519699 |             0 |              0 | 543053447
(1 row)

Is there something I might be able to see in logs to identify what’s happening?

Thanks.

Hey @russ,

Could you double check what SELECT COUNT(*) FROM title_basics; returns?

Also, what version of CRDB are you running?

Thanks,

Ron

Hi Ron,

Server version: CockroachDB CCL v19.1.4 (x86_64-apple-darwin18.6.0, built 2019/08/13 19:41:13, go1.12.7) (same version as client)

select count(*) from title_basics;
count
±--------+
4519699
(1 row)

Time: 2.948499s

wc -l title.basics.tsv
6064557 title.basics.tsv

Thanks.
Russ

Seems to be double quotes interfering with tab and newline parsing.
Importing into postgres breaks on every record that has a single double quote in a field eg feet symbol.

Te reason I couldn’t find errors in the Cockroach logs is because Cockroach doesn’t consider there to be any errors. Confirmed by changing data to incorrect types, in which case Cockroach breaks on import in a similar manner to postgres.

So far, it seems postgres has a more robust tsv parser, reporting double quote inconsistencies, whereas Cockroach powers over them continuing to import subsequent rows as part of the field until it hits a balancing closing double quote. Resulting in no errors reported and missing rows.

Hi @russ, thanks for helping debug this. Can you send an example of a malformed input line that silently succeeds in Cockroach but fails in Postgres, or that causes CockroachDB to swallow the rest of the input until it hits another double-quote?

I just tried importing the following CSV with a CockroachDB 19.2 alpha, and each line ended up in its own row; I may be misunderstanding your schema, though:

1,6" heels,rest of line
2,another line,blah blah
3,another single " quote,third field
4,"quotes " in quotes",end
root@:26257/defaultdb> import table test (a string, b string, c string) csv data ('nodelocal:///2.txt');
        job_id       |  status   | fraction_completed | rows | index_entries | system_records | bytes
+--------------------+-----------+--------------------+------+---------------+----------------+-------+
  479517404915302401 | succeeded |                  1 |    4 |             0 |              0 |   170
(1 row)

Time: 46.285ms

root@:26257/defaultdb> select * from test;
  a |           b            |      c
+---+------------------------+--------------+
  1 | 6" heels               | rest of line
  2 | another line           | blah blah
  3 | another single " quote | third field
  4 | quotes " in quotes     | end
(4 rows)

Time: 8.097ms

Hi Roland,

I’m using a Tsv; here’s an example line which ends up being imported with the third field (primary_title) containing the subsequent 39026 lines (3112640 characters in length).

Fields:

tconst | title_type | primary_title | original_title | is_adult | start_year | end_year | runtime_minutes | genres

Faulty imported record (dataset line 32548) starts with:

select tconst, left(primary_title, 120) as left_of_primary_title from title_basics where tconst = ‘tt0033122’;
tconst | left_of_primary_title
±----------±---------------------------------------------------------------------------------------------+
tt0033122 | Swing it" magistern “Swing it” magistern 0 1940 \N 92 Comedy,Music
| tt0033123 short Swinging the Lambeth Walk Swinging th
(1 row)

And ends with the start of dataset line 71574:

select tconst, right(primary_title, 120) as right_of_primary_title from title_basics where tconst = ‘tt0033122’;
tconst | right_of_primary_title
±----------±---------------------------------------------------------------------------------------------+
tt0033122 | ,Sci-Fi
| tt0073044 short Giger’s Necronomicon Giger’s Necronomicon 0 1975 \N 40 Documentary,Short
| tt0073045 movie "Giliap
(1 row)

Length of field:

select tconst, length(primary_title) as length_of_primary_title from title_basics where tconst = ‘tt0033122’;
tconst | length_of_primary_title
±----------±------------------------+
tt0033122 | 3112640
(1 row)

The dataset is here:
https://www.imdb.com/interfaces/
title.basics.tsv.gz (103MB)

Thanks for your help.

I’ve found the root cause. I’ll report back with details in a day or two to spare you the running commentary.

@russ Fantastic, thank you for the detailed reproduction info. I understand the issue and have filed a more minimal reproduction here. It looks like this could be mitigated by allowing users to suppress our default CSV quotation mark escaping behavior. Feel free to chime in on the ticket if you have any additional details to provide.

@russ we have introduced a new data format DELIMITED that should solve your issue. It’s a simpler format than csv and it allows any quote characters in fields that are not quoted. Please try running from https://github.com/cockroachdb/cockroach/commit/d8c99dcf5fbf5f772f3350edfe429a9236057a56

IMPORT TABLE title_basics (
  tconst text primary key, title_type string, primary_title string, original_title string, is_adult int,
  start_year int, end_year int, runtime_minutes int, genres text
) DELIMITED DATA ('nodelocal:///title.basics.tsv') WITH fields_escaped_by ='\';

SELECT COUNT(*) FROM title_basics;
   count
+---------+
  6137393