Import in 1.2-alpha.20171204


(Ronald) #1

Hi,
I was a bit surprised with the change about the prefix of the nodelocal location but found my way out.

In the extern subdirectory in ds1 I created a few symbolic links pointing to where the original locations:
[cockroach@ip-172-31-39-252 extern]$ ls -l
total 0
lrwxrwxrwx. 1 cockroach cockroach 28 Dec 5 15:16 imp_tmp -> /media/nfs/cockroach/imp_tmp
lrwxrwxrwx. 1 cockroach cockroach 27 Dec 5 15:03 zabbix -> /media/nfs/cockroach/zabbix

changed the import command to:
\set show_times
drop table if exists maintenances;

import TABLE maintenances (
maintenanceid bigint NOT NULL,
name varchar(128) NOT NULL,
maintenance_type integer NOT NULL,
description text NOT NULL,
active_since integer NOT NULL,
active_till integer NOT NULL,
PRIMARY KEY (maintenanceid)
)
csv data (‘nodelocal:/zabbix/data/maintenances.csv’)
with temp=‘nodelocal:/imp_tmp/maintenances’, nullif = ‘\N’;

rm -rf /media/nfs/cockroach/imp_tmp/*
cockroach sql -d zabbix <imp_all
DROP TABLE

Time: 694.531µs

pq: importing 3 ranges: fetching “1.sst”: open /var/lib/cockroachdata/ds1/extern/imp_tmp/maintenances/1.sst: no such file or directory
Error: pq: importing 3 ranges: fetching “1.sst”: open /var/lib/cockroachdata/ds1/extern/imp_tmp/maintenances/1.sst: no such file or directory
Failed running “sql”

strange is that the file is present:
[cockroach@ip-172-31-39-252 zabbix]$ ls -l /var/lib/cockroachdata/ds1/extern/imp_tmp/maintenances/
total 8
-rw-r–r--. 1 cockroach cockroach 1808 Dec 5 15:20 1.sst
-rw-r–r--. 1 cockroach cockroach 714 Dec 5 15:20 BACKUP

an other strange thing is that it did work, once … and after a retry this failure came.
I clean-up the imp_tmp directory after every run…

Any ideas about what I am doing wrong here?


(Diana Hsieh) #2

sorry about that node local thing by the way. Some of our users were using it in a way that it wasn’t intended for, so we made an edit to prevent that.

Regarding this issue, it looks like something that @mjibson would have the most context to provide a good answer.


(Matt Jibson) #3

You say it worked once, so I’m wondering about the backing store of that directory. It looks like you are using an NFS mount. I wonder if there’s some race condition in your NFS drive where a write isn’t immediately readable, which is causing this to happen. If you are only on a single node cluster, what happens when you don’t use an NFS directory?


(Ronald) #4

I have to re-test this again, with a little luck tomorrow. I only created the sys links on the node where I started the import. That could be the cause, if all nodes have to be able to access the same path.


(Matt Jibson) #5

Yes, that is a requirement. If you use nodelocal, then all nodes must have exactly the same view of the files as all others. This is because any node could be assigned to importing a segment of data (i.e., not just the node that was issued the SQL command).


(Ronald) #6

I got this a little further. I changed the startup, specified the external IO dir pointing to the directory on nfs, for all nodes.

drop table if exists alerts;
import TABLE alerts (
alertid bigint NOT NULL,
actionid bigint NOT NULL,
eventid bigint NOT NULL,
userid bigint NULL,
clock integer NOT NULL,
mediatypeid bigint NULL,
sendto varchar(100) NOT NULL,
subject varchar(255) NOT NULL,
message text NOT NULL,
status integer NOT NULL,
retries integer NOT NULL,
error varchar(128) NOT NULL,
esc_step integer NOT NULL,
alerttype integer NOT NULL,
PRIMARY KEY (alertid)
)
csv data (‘nodelocal:/zabbix/data/alerts.csv’)
with temp=‘nodelocal:/imp_tmp/alerts’, nullif = ‘\N’;
pq: IO error: While open a file for appending: /var/lib/cockroachdata/ds1/cockroach-temp073974449/000006.log: No such file or directory
Error: pq: IO error: While open a file for appending: /var/lib/cockroachdata/ds1/cockroach-temp073974449/000006.log: No such file or directory
Failed running “sql”

Quite a list of tables was imported before this failure came.


(Ronald) #7

I got this a little further. I changed the startup, specified the external IO dir pointing to the directory on nfs, for all nodes.

drop table if exists alerts;
import TABLE alerts (
snip
PRIMARY KEY (alertid)
)
csv data (‘nodelocal:/zabbix/data/alerts.csv’)
with temp=‘nodelocal:/imp_tmp/alerts’, nullif = ‘\N’;
pq: IO error: While open a file for appending: /var/lib/cockroachdata/ds1/cockroach-temp073974449/000006.log: No such file or directory
Error: pq: IO error: While open a file for appending: /var/lib/cockroachdata/ds1/cockroach-temp073974449/000006.log: No such file or directory
Failed running “sql”

Quite a list of tables was imported before this failure came.

fwiw:

  1. the cockroach-temp073974449/000006.log: No such file or directory directory does not exist on any of my nodes
  2. de imp_tmp/alerts/ directory is empty after this error (and created during the import)

(Matt Jibson) #8

The cockroach-temp directory there is used to store intermediate values. It is cleared on node startup. Is the ds1 directory also on NFS? Is it possible another process is deleting that directory? The error message here suggests strongly that something deleted that file (that isn’t the cockroach process that created the file).


(Ronald) #9

The ds1 directory is on local storage, not on nfs.
Afaik there are no other processes interfering with contents of ds1. I will try to retest tomorrow.


(Ronald) #10

I ran a test with only 1 node, insecure.
(for some reason I seem to be unable to create a secure config now(could be me))

> CREATE INDEX history_1 ON history (itemid,clock);
CREATE INDEX

Time: 15h23m16.027320431s

> drop table if exists history_uint;
DROP TABLE

Time: 32.211197ms

> import TABLE history_uint (
        itemid                   bigint                                    NOT NULL,
        clock                    integer                        NOT NULL,
        value                    numeric(20)                    NOT NULL,
        ns                       integer                        NOT NULL
)
             csv data ('nodelocal:/zabbix/data/history_uint.csv')
            with temp='nodelocal:/imp_tmp/history_uint', nullif = '\N';
pq: result is ambiguous (job lease expired)
Error: pq: result is ambiguous (job lease expired)
Failed running "sql"

In the job log, the restore is marked as completed:
303863216947167233 RESTORE csv.* FROM ‘nodelocal:///imp_tmp/history_uint’ root 2 hours ago
Succeeded
Duration: 00:25:52
303847896803409921 IMPORT TABLE history_uint (itemid BIGINT NOT NULL, clock INTEGER NOT NULL, value NUMERIC(20) NOT NULL, ns INTEGER NOT NULL) CSV DATA (‘nodelocal:///zabbix/data/history_uint.csv’) WITH “nullif” = e’\N’, temp = ‘nodelocal:///imp_tmp/history_uint’, transform_only root 3 hours ago
Succeeded
Duration: 01:17:55
303666374881509377 CREATE INDEX history_1 ON zabbix.history (itemid, clock) root 19 hours ago
Succeeded
Duration: 15:23:11

I ran this with cache size 50% and sql cache 25% (1G memory and 4G swap) Compared to running with 25% cache, this was a lot slower …
I run every test with a fresh created cluster.


(Ronald) #11

a new retest failed again, this time one table further on the road:

> CREATE INDEX history_uint_1 ON history_uint (itemid,clock);
CREATE INDEX

Time: 42h57m3.574148574s

> drop table if exists history_str;
DROP TABLE

Time: 9.611227ms

> import TABLE history_str (
        itemid                   bigint                                    NOT NULL,
        clock                    integer                        NOT NULL,
        value                    varchar(255)                    NOT NULL,
        ns                       integer                        NOT NULL
)
             csv data ('nodelocal:/zabbix/data/history_str.csv')
            with temp='nodelocal:/imp_tmp/history_str', nullif = '\N';
pq: result is ambiguous (job lease expired)
Error: pq: result is ambiguous (job lease expired)
Failed running "sql"

would it make sense to proceed to the next alpha release?
(and again, the job listing shows the last restore succeeded:
304578216009138177 RESTORE csv.* FROM ‘nodelocal:///imp_tmp/history_str’ root 13 hours ago
Succeeded
Duration: 00:02:40
304577495475093505 IMPORT TABLE history_str (itemid BIGINT NOT NULL, clock INTEGER NOT NULL, value VARCHAR(255) NOT NULL, ns INTEGER NOT NULL) CSV DATA (‘nodelocal:///zabbix/data/history_str.csv’) WITH “nullif” = e’\N’, temp = ‘nodelocal:///imp_tmp/history_str’, transform_only root 13 hours ago
Succeeded
Duration: 00:03:39
304070826410147841 CREATE INDEX history_uint_1 ON zabbix.history_uint (itemid, clock) root 2 days ago
Succeeded
Duration: 42:56:59


(Matt Jibson) #12

The error there says “result is ambiguous”, which is actually true. It means that the node you were connected to had a (possibly minor) liveness outage and decided it didn’t know if it was connected to a quorum anymore. If you look in your admin UI tab under jobs, you will see the job and if it was still running. IMPORT happens in 2 phases: IMPORT then RESTORE. If it had progressed to the RESTORE phase, then this ambiguous result error is fine and it’ll complete eventually (RESTORE is resilient to node failures). If it was still in the IMPORT phase then it needs to be restarted because IMPORT is not resilient to node failures.

We’re fully aware this is a bad situation and already have plans on how to improve it. But for now check the admin ui jobs tab to see what’s going on.


(Matt Jibson) #13

Ah! I’ve been able to reproduce the “No such file or directory” bug above by running two imports at the same time. So definitely our fault and not yours. Investigating.


(Ronald) #14

It also did not reproduce running with a 1 node ‘cluster’.
The error there says “result is ambiguous”, which is actually true. … I have my doubts … running locally on a 1 node ‘cluster’ …there is no connectivity problem. I have been monitoring memory and swap usage … no problems there so there is more.
Also strange: the error occurred after the restore part of the import completed according to the jobs overview. The client did not report a completed import. In the last example the restore part of the history_str table succeeded, the last report on the client was the echo for the import command.


(Matt Jibson) #15

The error can happen on a 1-node cluster, even though it is fairly weird sounding. It’s not a connectivity issue between your client and server. Instead, it’s a message that the server isn’t exactly sure what happened with itself, sometimes due to a disk write taking longer than expected.


(Ronald) #16

If there is anything I can do to help debugging, just let me know.


(Matt Jibson) #17

I think I figured out the “No such file or directory” error. If you are running an IMPORT and then accidentally run “cockroach start” in the same directory, it will delete the temp store used by IMPORT which will produce that error message. It happened to me because I mistyped something, and I would wager the same thing happened with you. Does that explanation seem likely?


(Ronald) #18

Oops… I won’t say impossible. But when using bash command history over and over again … it could happen. This was when running 4 nodes…:thinking:
Started all 4
Returned to node1
Started import job in background…
It could be possible.:flushed:


(Ronald) #19

The ‘ambiguous’ error is worrying me more. Just try to import 120 tables …
I never reached the end😳
Memory conditions are suboptimal but never 0. Index creation slow but I assume this is caused by low memory. It does succeed.
At the end of the restore phase this pops up. In jobs overview: ‘succeeded’ in the client logfile no mention of a result at all, just the error.


(Matt Jibson) #20

Yes, that’s exactly how the ambiguous error works. We have a plan to solve it, hopefully will implement by 2.0 (the next major release).