One node is dead agter import csv into Cockroachdb cluster

Hi,

After I import the csv into the cockroachdb cluster,one node in the cluster is dead.My cluster is deployed on three host. I am seeing the below error in my log.

I171107 18:42:13.660024 409 server/status/runtime.go:223  [n3] runtime stats: 55 GiB RSS, 634 goroutines, 131 MiB/94 MiB/379 MiB GO alloc      /idle/total, 50 GiB/55 GiB CGO alloc/total, 434.64cgo/sec, 0.15/0.06 %(u/s)time, 0.00 %gc (1x)
10094 terminate called after throwing an instance of 'std::bad_alloc'
10095   what():  std::bad_alloc
10096 SIGABRT: abort
10097 PC=0x7fdc4b59c5d7 m=15 sigcode=18446744073709551610
10098 signal arrived during cgo execution
10099             
10100 goroutine 221579 [syscall, locked to thread]:
10101 non-Go function
10102     pc=0x7fdc4b59c5d7
10103 non-Go function
10104     pc=0x7fdc4b59dcc7
10105 non-Go function
10106     pc=0x19781d2
10107 non-Go function
10108     pc=0x18ecdb0
10109 non-Go function
10110     pc=0x18ecdd3
10111 non-Go function
10112     pc=0x18e9dc6
10113 non-Go function
10114     pc=0x18ea2ea
10115 non-Go function
10116     pc=0x18e91ba
10117 non-Go function
10118     pc=0x17ee177
10119 runtime.cgocall(0x1666810, 0xc430ca4b28, 0x1d431a7)
10120     /usr/local/go/src/runtime/cgocall.go:131 +0xe2 fp=0xc430ca4ae8 sp=0xc430ca4aa8
10121 github.com/cockroachdb/cockroach/pkg/storage/engine._Cfunc_DBIterNext(0x7fdbe7a36758, 0x7fdbe7a36700, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,       0x0, ...)
10122     github.com/cockroachdb/cockroach/pkg/storage/engine/_obj/_cgo_gotypes.go:453 +0x69 fp=0xc430ca4b28 sp=0xc430ca4ae8
10123 github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Next.func1(0x7fdbe7a36758, 0x275bf3f5199dbe00, 0x0, 0x0, 0x0, 0x0,       0x0, 0x0, 0x0, 0x0, ...)
10124     /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1663 +0xae fp=0xc430ca4bd8 sp=0xc430ca4b28
10125 github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBIterator).Next(0xc43c71e000)                                                
10126     /go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1663 +0x5a fp=0xc430ca4c88 sp=0xc430ca4bd8
10127 github.com/cockroachdb/cockroach/pkg/storage.(*ReplicaDataIterator).Next(0xc43041c1e0)
 /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_data_iter.go:107 +0x34 fp=0xc430ca4ca0 sp=0xc430ca4c88
10129 github.com/cockroachdb/cockroach/pkg/storage.(*Replica).sha512(0xc421ebdc00, 0x162, 0xc42029e230, 0xa, 0x10, 0xc4212fbda0, 0x9, 0x10, 0xc      4222619b0, 0x3, ...)
10130     /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_command.go:2416 +0x3a7 fp=0xc430ca4e70 sp=0xc430ca4ca0
10131 github.com/cockroachdb/cockroach/pkg/storage.(*Replica).computeChecksumPostApply.func1(0x7fdc4c18f000, 0xc4224219e0)
10132     /go/src/github.com/cockroachdb/cockroach/pkg/storage/replica_proposal.go:424 +0x101 fp=0xc430ca4f68 sp=0xc430ca4e70
10133 github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc4208a8bd0, 0x7fdc4c18f000, 0xc4224219e0, 0xc422421ce0, 0x      2b, 0x2adb4a0, 0xc42023d5c0, 0xc42fe4e000)
10134     /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:269 +0xe6 fp=0xc430ca4fa0 sp=0xc430ca4f68
10135 runtime.goexit()
10136     /usr/local/go/src/runtime/asm_amd64.s:2197 +0x1 fp=0xc430ca4fa8 sp=0xc430ca4fa0
10137 created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
10138     /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:270 +0x133

I have no idea what to do next.I need a help. Thanks.

Hi @going. That std::bad_alloc error usually means that you’ve run out of
memory. How much RAM do your machines have?

If you restart the node, do you get this error again or does it work then?

Hi:
Thanks for your help. When I start the cluster ,I use the statement described as below. Is the parameter --cache and --max–sql–memory given too high to lead my marchine crash down. Then I restart my cluster with the --cache=10% and --max–sql–memory=10% and it works well.

cockroach start --insecure --host=172.16.50.102 --join=172.16.50.101:26257,172.16.50.102:26257,172.16.50.103:26257 --cache=25% --max-sql-memory=25% --background

[root@ceph2 ~]# free
total used free shared buff/cache available
Mem: 263863720 52790212 142086768 35680 68986740 210210552
Swap: 4194300 346852 3847448
[root@ceph2 ~]#

Hi @going.

We recommend --cache=25% --max-sql-memory=25% so that should have been
okay.

You have 256GB of ram? That should obviously be plenty and this would be
very unexpected, so I’d like to hear more about how it happened. Was it a
new cluster? How much csv data did you import? Was there any other query
traffic during the import?

Hi @dan:

Thanks. The detail information is that I init a cluster with three nodes and then import csv into the cluster.The dead node is not a new node. The csv data I import into the cluster is generated by the tool benchmarksql. When I generate the data , I set the parameter -warehouse=1000.The total size of the csv file is 70G. I had imported seven of nine talbes totally size 40G to the cluster before the node dead. The node dead at three o’clock in the night. I have no import operation and no other operation during the time of the node broke down, and there was no other query traffic during the import. The next day I found the node dead and look the log of the node, then saw the log record I’ve already described at the beginning. I don’t know if I describe it clearly. Do you have any other details you’d like to know?

Thanks @going, hopefully that will be enough for us to try to reproduce the
problem. What you did should have worked. I filed an issue to track
https://github.com/cockroachdb/cockroach/issues/19968 and we’ll follow up
if we have any more questions.