/var/log/messages file is huge

I have a 5 node cluster v19.1.3

I am not exactly sure what the cluster is doing but about every 8 minutes the nodes crash with

std::bad_alloc

The servers are kvm centos 7 linux servers with 64GB of ram. It looks like at the time of the crash the free -m command will have

total: 64265
used: 22246
free: 342
shared: 3207
buff/cache: 41294
available: 37912

My /var/log/messages file is rather large at 2.4GB just for the day. I see a lot of messages like (sorry have to type all manually no copy/paste ability)

node: dbserver01 type=SYSCALL msg=audit arch=c000003e syscall=263 success=no exit=-2 items=1 comm=“cockroach” exe="/data/cockroach-v19.1.3.linux-amd64/cockroach subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=“delete”

There are tons of these that fill the log. I am working with the developer to see if there are deletes going on.

I saw this post

and it seems very similar to what we are seeing. I do not see any delete command in the cockroach gui however under statements. My data directory is 250GB in size.

Is there a way to reduce the logging to /var/log/messages, and from what I can see with the std::bad_alloc is that we ran out of memory however the available memory under free -m is always showing about 30GB free and then it crashes, available then jumps back to 50GB and the countdown back down to ~ 30GB happens again and the loop continues.

I have a 1GB swapfile. It seems like my first step might be to increase that and then maybe increase the RAM on one of the members in the cluster and see if this node survives.

I’m new to cockroachdb so any help is welcome.

thanks

So I was able to suppress all of messages going into the /var/log/messages file. They were not sql deletes so something else is going on that keeps crashing my db nodes.

When these memory settings are configured

cache .25
max-sql-memory .25

cockroachdb will crash once the process hits 22 GB. This happens consistently. I got this value using

pmap pgrep cockroach

If I modify the .cache parameter to .50, cockroach crashes at 28GB with std::bad_alloc

I’ve tested setting max-sql-memory to .50 but cockroach still crashed at 22GB

I’m confused here a bit because I would imagine my .50 setting would make the .cache be 32GB and I am crashing at 28GB with bad_alloc. I would think I have some more memory available before crashing.

In reading this: https://www.cockroachlabs.com/blog/memory-usage-cockroachdb/

it seems that the max-sql-memory should keep my node from crashing from poor sql. I am guessing that the crash is caused b/c the cache setting does not have enough RAM (yet it appears there is RAM available)

I think there is probably some bad sql in the. i am digging thru that now to make sure it is properly indexed.

My nodes still crash every 10 - 15 minutes.

I am still seeing the restarts. They are less frequent and my nodes stay up longer. I don’t understand.

my max-sql-memory is set to 32GB (.50 of 64GB)
cache is set to 16GB (.25 of 64)

Before getting the terminate called after throwing an instance of ‘std::bad_alloc’ I see this in the log

runtime stats: 22GiB RSS, 364 goroutines, 303 MiB/172MiB/613 MiB GO alloc/idle/total, 17GiB/21 GiB CGO alloc/total, 399938.5 CGO/sec, 350.7/53.7 %(u/s)time, 0.1 %gc (9x), 32 MiB/36 MiB (r/w)net

So looks like 22GB allocated to the cockroachdb process. Shouldn’t there be RAM available based on my memory settings?

I have set my parameters as such

cache .25
max-sql-memory .50

and have not seen a crash in over an hour (nodes were crashing every 10-15 minutes). I think I may have found the sweet spot

Well cockroach db is still crashing on the nodes. I’m not sure where else to look. The only really useful info from the logs is the std::bad_alloc.

I still do not know how to fully understand this line

runtime stats: 22GiB RSS, 364 goroutines, 303 MiB/172MiB/613 MiB GO alloc/idle/total, 17GiB/21 GiB CGO alloc/total, 399938.5 CGO/sec, 350.7/53.7 %(u/s)time, 0.1 %gc (9x), 32 MiB/36 MiB (r/w)net

I am guessing it is crashing when a certain memory allocation hits 22GB?

With max-sql-memory at .50, I should have 32GB of ram for sql operations and 16GB for rocksdb cache, so this 22GB number does not make sense to me nor do the other numbers in that line.

We are going to try to increase the RAM on one of the servers and see if this fixes the issue or extends the length of time between crashes.

Increased system RAM from 64GB to 128GB and cockroach db nodes are still crashing with

cache .25
sql-memory .50

I don’t get it. Am I tweaking the wrong parameter? Should I be increasing cache?

5 nodes. ~295GB data per node

hey @jackweed,

Sorry for the delay in response, if you’ve increased the RAM and nodes are still crashing we’d like to be able to look into this further, can you send over a debug zip and memory profiles. Could you also run sudo dmesg | grep -iC 3 "cockroach" this will let us know for sure if the process was killed because of insufficient memory.

Lastly, could you access your admin UI and and do the following:

  1. Navigate to Metrics > Runtime dashboard, and check the Memory Usage graph.
  2. On hovering over the graph, the values for the following metrics are displayed and get the values you see when your nodes crash:
Metric Description
RSS Total memory in use by CockroachDB.
Go Allocated Memory allocated by the Go layer.
Go Total Total memory managed by the Go layer.
CGo Allocated Memory allocated by the C layer.
CGo Total Total memory managed by the C layer.

You can send all of that here.

Thanks!

Ron,

Thanks for the reply. Unfortunately, I am unable to send any files. Anything I have to type is by hand.
I know this complicates troubleshooting :frowning:

I will generate this debug and if there is anything in particular I can look for let me know.

All servers have 64GB of RAM except for one that has 128GB of RAM

The only server that does not have any output from the dmesg command is the server with 128GB of RAM.

Most output looks like

type 1307 audit(15793893.218:153049237): cd="/home/cockroachdb"
type=1302 audit(15793893.218:153049237): item=0 name=’/data/cockroachdb-data/auxiliary/sideloading/r6XXXX/r68901" objtype=UNKNOWN cap_fp=000
cap_fi=000 cap_fe=0 cap_fver=0
bash (5839): drop_caches: 1

From the UI when the node crashes the highpoint of the graph looks like this

RSS 126.3 GiB
Go Allocated 1.7 GiB
Go Total 3.1 GiB
CGo Allocated 99.7 GiB
CGo Total 124.0 GiB

It looks like your CGo Allocated is much larger than you cache size if cache is 25% of 128, your cache is 32GiB, this definitely indicates that something is wrong. Could you try the follow:

  1. stop the cockroach process
  2. in a terminal, start cockroach with gdb --args prepended to the cockroach start command (and --background removed if are using it).
  3. this will give you a prompt where you can run b std::bad_alloc::bad_alloc() followed by run (or it may be continue? Please try both)
  4. wait for the crash, then run bt and look at the output

Regards,

Ron

Thanks. So I am trying to test this but getting an error running gdb.

I need to run debuginfo-install glibc

however I do not have any debug repositories at the moment. Looking to get in touch with the sysadmins to find out if we have any that I can use internally. my systems are not public facing.