Deleting a large table crashes the cluster

I have set up a 3-node cluster to evaluate the use of CockroachDB for storing a large table. In my testing, I found a way to reproducibly crash the cockroach daemon on every node in the cluster.

I created a table called “bucket”, and generated 165 million rows of artificial data. When I execute a “DELETE FROM bucket” SQL command via the psycopg2 Python driver, the memory usage of the cockroach process increases rapidly until the machine runs out of memory and the process crashes with a std::bad_alloc error. Here are a few lines of context from the error log around the crash:

terminate called after throwing an instance of 'terminate called recursively
terminate called recursively
terminate called recursively
  what():  std::bad_alloc
SIGABRT: abort

The cluster is comprised of 3 virtual machines running CentOS 7.4. Each machine has 4 vCPUs, 8GB RAM, 10GB of swap space, and 100GB of disk space. The size of the “bucket” table at the time of the crash is 8GB with 462 ranges. Memory overcommitting is disabled at the kernel level (sysctl vm.overcommit_memory=2).

I also discovered that when I try running the same DELETE query via the cockroach CLI, it rejects the query as being unsafe.

Here’s the error message:

$ cockroach sql --insecure
# Welcome to the cockroach SQL interface.
# All statements must be terminated by a semicolon.
# To exit: CTRL + D.
# Server version: CockroachDB CCL v19.1.3 (x86_64-unknown-linux-gnu, built 2019/07/08 18:24:39, go1.11.6) (same version as client)
# Cluster ID: d1a47f5c-a202-484d-845f-799e94f9e4df
# Enter \? for a brief introduction.
root@:26257/defaultdb> delete from bucket;
pq: rejected: DELETE without WHERE clause (sql_safe_updates = true)

Hey @heily,

Thank you for bringing this up.

I’ve created a GitHub issue for this error.

For large deletes like this (where you are deleting all rows), TRUNCATE is the better option.

As for the CLI error, this is just a warning for performing a delete without a WHERE clause (it’s complaining about a potetially dangerous sql statement).

You can disable this error by changing this cluster setting, set sql_safe_updates = false;

However please note that you will no longer be protected from short queries like that(that will delete the whole table). You can read more about that here. See the section about sql_safe_updates.


Hi Matt,

Thanks for looking into this. I will switch to TRUNCATE for my personal testing in the future.

However, in a typical production database deployment, we give users the ability to run arbitrary queries, and they expect that big queries will be slow but eventually complete. We can’t control how many rows are affected by each query, or the size of the transaction.

Are there other situations where a badly designed query can cause CockroachDB to consume all available memory?

Hey Mark,

Not that i am aware of, the issue you raised is a bug.

However, if you come across a poorly designed query that is consuming all available query. Please let us know.


I was trying to make a copy of the big table before truncating it, so I ran the following query:

create table bigtest as select * from bucket;

This caused one of the cockroach nodes to crash. I was running it in the foreground, and it didn’t show any error messages on stdout, but the process exited with return code 2.

Here’s the backtrace from the logs on the server node that crashed:

fatal error: runtime: out of memory
runtime stack:
runtime.throw(0x32ec9f0, 0x16)
        /usr/local/go/src/runtime/panic.go:608 +0x72
runtime.sysMap(0xc2a8000000, 0x90000000, 0x5bba258)
        /usr/local/go/src/runtime/mem_linux.go:156 +0xc7
runtime.(*mheap).sysAlloc(0x5b74ec0, 0x90000000, 0x40014b86ba, 0x7fe0f6ffe9a0)
        /usr/local/go/src/runtime/malloc.go:619 +0x1c7
runtime.(*mheap).grow(0x5b74ec0, 0x47d2b, 0x0)
        /usr/local/go/src/runtime/mheap.go:920 +0x42
runtime.(*mheap).allocSpanLocked(0x5b74ec0, 0x47d2b, 0x5bba268, 0x2030a900000000)
        /usr/local/go/src/runtime/mheap.go:848 +0x337
runtime.(*mheap).alloc_m(0x5b74ec0, 0x47d2b, 0x720100, 0x7fe0481ba000)
        /usr/local/go/src/runtime/mheap.go:692 +0x119
        /usr/local/go/src/runtime/mheap.go:759 +0x4c
runtime.(*mheap).alloc(0x5b74ec0, 0x47d2b, 0x7fe048010100, 0x10a)
        /usr/local/go/src/runtime/mheap.go:758 +0x8a
runtime.largeAlloc(0x8fa56000, 0xc000060001, 0xaa)
        /usr/local/go/src/runtime/malloc.go:1019 +0x97
        /usr/local/go/src/runtime/malloc.go:914 +0x46
        /usr/local/go/src/runtime/asm_amd64.s:351 +0x66

Here’s the logs from the client side:

root@:26257/defaultdb> create table bigtest as select * from bucket;
driver: bad connection
warning: connection lost!
opening new connection: all session settings will be lost
warning: error retrieving the transaction status: dial tcp connect: connection refused
warning: connection lost!
opening new connection: all session settings will be lost
warning: error retrieving the database name: dial tcp connect: connection refused
root@:26257/? ?>
1 Like

Thanks for letting me know!

We have some significant improvements for create table <table2> as select * from <table2> in our upcoming 19.2 release.

We are aware of this performance issue, you can view it here.


After the crash reported above, I was unable to restart the cockroach daemon on the crashed node. Here’s the error it gave me:

* ERROR: [n1,s1] a panic has occurred!
tocommit(29897) is out of range [lastIndex(28814)]. Was the raft log corrupted, truncated, or lost?

I pressed Control-C to stop the other two nodes, and they both failed to stop cleanly. The error message was something like:

^CNote: a second interrupt will skip graceful shutdown and terminate forcefully
initiating graceful shutdown of server
* ERROR: time limit reached, initiating hard shutdown - node may take longer to restart & clients may need to wait for leases to expire
Failed running "start"

After all three nodes were shutdown, I was able to start them up using:

/usr/local/bin/cockroach start --insecure --advertise-addr=$(hostname -I | awk '{print $1}') --join=$node1_ip,$node2_ip,$node3_ip --cache=.25 --max-sql-memory=.25 --store=/data/cockroach-data

What you’re reporting is unexpected behavior, I would like to get to the bottom of this.

Is there any other information you can provide me regarding your deployment? Are you using a container?

Can you provide me with a

You can email it to If the file is too large i can send you a private link to upload it to me.

Can you confirm that node 1 is the crashed node?


Thanks for the offer of help. I’ve emailed a file to the support@ address.

The node with the NodeID=1 is the one that crashes. This is the node that my psql client is communicating with when I issue the problematic query.

I can’t think of anything unusual about the deployment. I’m not using containers, these are three generic VMs that were built specifically for testing new databases, and CockroachDB was the first database I tested on them. For the filesystem underneath the database, I’m using ext4 with the mount options “noatime,barrier=0”.

Hey Mark,

Regarding this error:

* ERROR: [n1,s1] a panic has occurred!
tocommit(29897) is out of range [lastIndex(28814)]. Was the raft log corrupted, truncated, or lost?

When trying to restart the cockroach daemon on the crashed node.

What flags were you using on the start command? Are you using --logtostderr.


Additionally, the fix for your original issue should be in the next 19.1 release.

Just wanted to give you an update.

Hi All,

I’m on the latest version, v19.1.5, and doing a CTAS is still an issue on a large table. I’m doing the same:

create table history_load_test_itemid_all as select * from history ;

Table: HISTORY is 2.1 GiB, 28 ranges large (5 columns) and it is still crashing the system.

I was not able to see that the CTAS, created a JOB for the processing and when monitoring via the UI, you can clearly see how rapidly the memory consumption goes above available resources on the server.


Hi Ignacio,

I’ve responded to you on our support portal.