Crashed a node. What to do?

One of my three roaches on google cloud instances died whist I was using cockroach sql to access it from home.

I don’t have details of what I did except that it involved hitting Control-C when I realized my query would return too many results to be reasonable.

I logged into the instance and restarted it with:

$ cockroach start --certs-dir=certs --background

Was that the right thing to do?

Anyway now it is up again.

I don’t see any error messages in any logs.

Is there somewhere else to look for crash information?

Is there any logging options I should enable to get more info if this happens again.

In general, it’s a good idea to use some type of process manager to auto-restart cockroach nodes (or any server, really).

Anyway, the cool thing to do is to find anything you can to help us diagnose and fix the problem.

If the logs don’t say anything anything at all (just normal-looking entries, followed by the next startup), it is most likely due to either running out of memory (you can check dmesg for OOM killer logs) or out of disk (a simple df should show how full it is). Although out-of-disk would crash again pretty quick.

Diagnosing OOM issues usually starts with looking at the existing logs and any information you have about running queries and existing schemas. On our instances we also enable regular memory profiles, if you’re interested we can let you know how to get those (they’re not currently documented).

Actually I was about to ask what is the best way to maintain a roach with systemd as I do all our other services. I was about to try and create a service file that just executed:

/pathto/cockroach start --certs-dir=the/certs/directory

I’m all up for doing the cool thing and provide any info I can find.

So far I find nothing like out or memory (OOM) in dmesg or anywhere else.

Disk should not be a problem. There is 17G free. The database is only a few megabytes.

One interesting message I found was this:

TCP: request_sock_TCP: Possible SYN flooding on port 26257. Sending cookies.  Check SNMP counters

I’d certainly be interested in the memory profiles thing.

I can quite reliably crash a cockroach node.

Basically by starting a query that will return 100’s of thousands of rows and then hitting Control-C.

For what it’s worth, this:

FROM series_1
INNER JOIN sources
    ON series_1.source_id = sources.source_id;

Before I start expanding on the details I think I’ll update to the latest cockroach release.

I upgraded my three roaches to version v1.1.2 and ran my killer query above a few times.

That manged to kill 2 nodes at the same time!

I have not finalized the upgrade yet. When I have done that I’ll try again and start digging harder for clues if the problem persists.

I finalized the upgrade to 1.1.2. Not patient enough to wait a day as the instructions suggest.

My killer query above still brings down a node or two when aborted with Control-C.

Luckily I found some “panic” messages in the logs:

I171127 12:17:52.708796 13 server/server.go:1090  [n1] done ensuring all necessary migrations have run
I171127 12:17:52.708817 13 server/server.go:1092  [n1] serving sql connections
I171127 12:17:52.708962 13 cli/start.go:582  node startup completed:
CockroachDB node starting at 2017-11-27 12:17:52.708870966 +0000 UTC (took 5.3s)
build:      CCL v1.1.2 @ 2017/11/02 19:32:03 (go1.8.3)
admin:      https://crdb-1:8080
sql:        postgresql://root@crdb-1:26257?application_name=cockroach&sslmode=verify-full&sslrootcert=certs%2Fca.crt
logs:       /home/rsm/cockroach-data/logs
store[0]:   path=/home/rsm/cockroach-data
status:     restarted pre-existing node
clusterID:  32b1e5c6-39bc-434b-abec-72ffecb1efaa
nodeID:     1
I171127 12:17:52.711541 42 storage/replica_proposal.go:453  [n1,s1,r103/1:/System/tsd/cr.node.sql.mem.c…] new range lease repl=(n1,s1):1 start=1511785034.827629451,0 epo=31 pro=1511785070.350425519,0 following repl=(n1,s1):1 start=1511785034.827629451,0 epo=31 pro=1511785034.827632665,0
I171127 12:17:52.712341 43 storage/replica_proposal.go:453  [n1,s1,r24/1:/System/tsd/cr.node.sys.c…] new range lease repl=(n1,s1):1 start=1511785054.198543741,0 epo=31 pro=1511785070.365554103,0 following repl=(n1,s1):1 start=1511785054.198543741,0 epo=31 pro=1511785054.198547089,0
panic: double close

goroutine 1808 [running]:*distSQLReceiver).ProducerDone(0xc420e46b40)
        /go/src/ +0x85, 0xc420ce7f20, 0x2d5f0c0, 0xc420e46b40, 0x0, 0x0, 0xc4227ebf90, 0x2, 0x2)
        /go/src/ +0x23b*mergeJoiner).Run(0xc421b94000, 0x7fa5f3ee3420, 0xc4203a24c0, 0xc42115a718)
        /go/src/ +0x28e
created by*Flow).Start
        /go/src/ +0x3fd


info: {SettingName:version Value:1.1 User:root}
I171127 12:14:49.527539 75 server/status/runtime.go:223  [n1] runtime stats: 245 MiB RSS, 141 goroutines, 40 MiB/20 MiB/72 MiB GO alloc
/idle/total, 125 MiB/142 MiB CGO alloc/total, 321.30cgo/sec, 0.04/0.01 %(u/s)time, 0.00 %gc (1x)
I171127 12:14:59.527606 75 server/status/runtime.go:223  [n1] runtime stats: 246 MiB RSS, 141 goroutines, 42 MiB/19 MiB/72 MiB GO alloc
/idle/total, 125 MiB/142 MiB CGO alloc/total, 326.70cgo/sec, 0.03/0.01 %(u/s)time, 0.00 %gc (1x)
I171127 12:15:09.527657 75 server/status/runtime.go:223  [n1] runtime stats: 247 MiB RSS, 141 goroutines, 43 MiB/18 MiB/72 MiB GO alloc
/idle/total, 133 MiB/150 MiB CGO alloc/total, 306.30cgo/sec, 0.04/0.01 %(u/s)time, 0.00 %gc (1x)
panic: too many ProducerDone() calls

goroutine 592120 [running]:*MultiplexedRowChannel).ProducerDone(0xc4219dc180)
        /go/src/ +0x98, 0xc422d85590, 0x2d5f300, 0xc4219dc180, 0x0, 0x0, 0xc420cfbf90, 0x2, 0x2)
        /go/src/ +0x23b*mergeJoiner).Run(0xc4211a8000, 0x7f9b12e7b310, 0xc4219dc240, 0xc42261f398)
        /go/src/ +0x28e
created by*Flow).Start
        /go/src/ +0x3fd


Let me know if there is any other info I could pull out of this.

Ah, glad you found one.
This looks like the bug fixed in

As luck would have it, this patch has been cherry-picked into 1.1.3 which is scheduled to be released today.

Can you wait a few hours and try again when 1.1.3 is out? I can ping here when it’s ready.

Wow, you guys are hot!

That certainly looks like it could be the issue.

I’ll be eagerly standing by ready to run 1.1.3 up the flag pole(s).

v1.1.3 is now available.


Thanks. Amazing. The upgrade went smoothly.

I ran my killer query 10 or twenty times. No problem. No more crashes.

I let it complete. Only took 12 minutes!. No problem.

I’m prepared to say my issue is fixed.

Only slightly annoying thing is that when I hit Control-C to abort a query the whole session is killed and I have to start again.

Well done everybody.

Now it’s time to start stressing my roaches…