Kernel:[3211908.377411] BUG: soft lockup - CPU#0 stuck for 30s! [cockroach:28447]

A few days ago I setup a low-spec 3 node cluster (3 servers) running Cockroach 1.0.2 on a Vultr VPS. I left an ssh connection open and connected to the command line client. When I switched to the ssh terminal window I found the following errors:

Message from syslogd@db2 at Jun 19 00:46:19 …
kernel:[3211908.377411] BUG: soft lockup - CPU#0 stuck for 30s! [cockroach:28447]

Message from syslogd@db2 at Jun 19 07:45:50 …
kernel:[3237079.650472] BUG: soft lockup - CPU#0 stuck for 43s! [cockroach:28469]

  • ERROR: [n2] Reported as error 9aa4cb8ea8e54799ba7c634d93dc89db

I’ve been using Vultr for database, web, and other hosting for a couple of years and have never run into this sort of message.

What hardware configuration are you using? One of the causes of this message is an overcommitted VM. We recommend at least 2GB per node, and have seen issues when running with only 1GB. This most often means that the process just dies, although depending on how the hosting provider handles provisioning I could imagine seeing this error instead (especially for lower-spec VMs).

If you’re not already doing so, try upgrading to VMs with more RAM. You might also be able to make it work by setting the --cache flag to a smaller value (it defaults to 25% of physical memory), but we don’t have any specific configuration recommendations at this time for machines with less than 2GB.

That makes sense. I thought it would be ok to use low-spec VMs for some initial testing (there was zero load on the server when the error occurred), but I will upgrade the servers and see if the error goes away.


On a related note (I think this is related enough not to warrant the creation of a new thread), how would a Cockroach cluster fare on small-ish VMs with large available swap space? Wouldn’t this fix the crash issue, if the OS is terminating the cockroach process once it runs out of RAM?

What is the recommendation when it comes to swap space?

For production usage, you don’t want much swap (it’s debatable whether you want to disable swap completely or run with a small amount of swap). For development/testing purposes, though, swap should be fine, but slow. It’s better to reduce the --cache flag than to let the cache grow to its maximum size and rely on swap to keep things from dying.