Cockroach node dies on delete from table

I am evaluating CockroachDB for my company. I am wanting to use it as a key value store, so I set up a table with two binary columns as suggested in the documentation. As part of a test, I insert 1 million rows into a newly created table, then do a number of ranged deletes “delete from table where key >= b’foo’ and key < b’foo2’” and on the first delete, the node I was talking to just up and died. Has this happened to anyone else? Is it a known issue? Because I am concerned about trusting my data to a system where such a simple command can just kill off a node.

Can you provide more information about what happened? Do you have log files from that node that misbehaved that you could share with us?

There should be a trace as to why it died. Either in cockroach’s log file, or in the kernel log, such as if cockroach was taking too much memory and was killed. Cockroach does seem to have some memory issues when dealing with moderate tables and I had to increase the memory of nodes from 8gb to 16gb for dealing with moderate tables only a few hundred mb…

How many of those million rows were covered by delete from table where key >= b'foo' and key < b'foo2'? Large deletions are currently a known weak spot, although there are supposed to be protections in place so they don’t completely kill a node. The logs would be helpful to better understand what’s happened here.

Hello,
here is the tail of the log when the node stopped running.

I170519 11:58:04.101010 91 server/status/runtime.go:227 [n1] runtime
stats: 73 MiB RSS, 73 goroutines, 17 MiB/2.6 MiB/29 MiB GO
alloc/idle/total, 16 MiB/31 MiB CGO alloc/total, 82.88cgo/sec, 0.00/0.00
%(u/s)time, 0.00 %gc (0x)

I170519 11:58:14.098174 90 gossip/gossip.go:452 [n1] gossip status (ok, 1
node)

gossip client (0/3 cur/max conns)

gossip server (0/3 cur/max conns, infos 0/0 sent/received, bytes 0B/0B
sent/received)

I170519 11:58:14.098357 91 server/status/runtime.go:227 [n1] runtime
stats: 73 MiB RSS, 74 goroutines, 13 MiB/6.3 MiB/29 MiB GO
alloc/idle/total, 16 MiB/31 MiB CGO alloc/total, 77.22cgo/sec, 0.00/0.00
%(u/s)time, 0.00 %gc (1x)

I170519 11:58:21.287960 123 vendor/
google.golang.org/grpc/transport/http2_client.go:1231 transport:
http2Client.notifyError got notified that the client transport was broken
EOF.

I170519 11:58:24.100742 91 server/status/runtime.go:227 [n1] runtime
stats: 74 MiB RSS, 75 goroutines, 13 MiB/6.5 MiB/29 MiB GO
alloc/idle/total, 16 MiB/31 MiB CGO alloc/total, 115.27cgo/sec, 0.01/0.00
%(u/s)time, 0.00 %gc (1x)

I170519 11:58:34.101168 91 server/status/runtime.go:227 [n1] runtime
stats: 67 MiB RSS, 75 goroutines, 9.9 MiB/8.7 MiB/29 MiB GO
alloc/idle/total, 9.2 MiB/25 MiB CGO alloc/total, 87.40cgo/sec, 0.04/0.00
%(u/s)time, 0.00 %gc (1x)

I170519 11:58:44.099434 91 server/status/runtime.go:227 [n1] runtime
stats: 67 MiB RSS, 75 goroutines, 15 MiB/4.1 MiB/29 MiB GO
alloc/idle/total, 9.2 MiB/25 MiB CGO alloc/total, 82.21cgo/sec, 0.00/0.00
%(u/s)time, 0.00 %gc (0x)

I170519 11:58:54.098395 91 server/status/runtime.go:227 [n1] runtime
stats: 67 MiB RSS, 75 goroutines, 12 MiB/8.7 MiB/30 MiB GO
alloc/idle/total, 9.2 MiB/25 MiB CGO alloc/total, 78.41cgo/sec, 0.01/0.00
%(u/s)time, 0.00 %gc (1x)

I170519 11:59:04.102166 91 server/status/runtime.go:227 [n1] runtime
stats: 67 MiB RSS, 75 goroutines, 17 MiB/4.4 MiB/30 MiB GO
alloc/idle/total, 9.2 MiB/25 MiB CGO alloc/total, 76.87cgo/sec, 0.00/0.00
%(u/s)time, 0.00 %gc (0x)

I170519 11:59:14.099118 90 gossip/gossip.go:452 [n1] gossip status (ok, 1
node)

gossip client (0/3 cur/max conns)

gossip server (0/3 cur/max conns, infos 0/0 sent/received, bytes 0B/0B
sent/received)

I170519 11:59:14.099145 91 server/status/runtime.go:227 [n1] runtime
stats: 67 MiB RSS, 75 goroutines, 12 MiB/8.2 MiB/30 MiB GO
alloc/idle/total, 9.2 MiB/25 MiB CGO alloc/total, 83.13cgo/sec, 0.01/0.00
%(u/s)time, 0.00 %gc (1x)

I170519 11:59:24.100013 91 server/status/runtime.go:227 [n1] runtime
stats: 68 MiB RSS, 75 goroutines, 18 MiB/3.1 MiB/30 MiB GO
alloc/idle/total, 9.6 MiB/25 MiB CGO alloc/total, 157.39cgo/sec, 0.00/0.00
%(u/s)time, 0.00 %gc (0x)

I170519 11:59:34.098536 91 server/status/runtime.go:227 [n1] runtime
stats: 68 MiB RSS, 75 goroutines, 13 MiB/7.5 MiB/30 MiB GO
alloc/idle/total, 9.6 MiB/26 MiB CGO alloc/total, 77.21cgo/sec, 0.00/0.00
%(u/s)time, 0.00 %gc (1x)

The delete itself should have only covered a few rows. Thanks for any
thoughts.

And the node just crashes there, with the runtime stats as the last line of the log? I don’t suppose you can get the exit status of the process at this point, can you?

What platform is this? If it’s linux (and the machine hasn’t rebooted since this happened), can you look in the output of dmesg to see if it was killed by the kernel’s OOM killer? (you’re looking for a line like [4151526.006699] Out of memory: Kill process 55007 (cockroach) score 918 or sacrifice child containing the name cockroach).

If you restart the node, does it work or crash again? If so, were the rows in question deleted or are they still there? Can you crash it again the the same query or a similar one?

Hi this is a macbook pro. This is a test I was asked to write for a company that would run a cluster of linux boxes if it works. It’s a new job for me and I’m not familiar with this environment, but I can try to look up anything that might help you. I can tell you that the node was restarted all the rows were still there and they appeared to be correct. This is all running from inside a big test program, so I’m not sure if issuing the command from the shell is quite valid. But I will try it when I have a little time, they already have me working on something else, so I may not get to it today.