Weird disk usage under constant load

I’ve encountered strange disk usage twice while running the load generators from against a 3-node cluster of CockroachDB 1.1.2 on Cloudstack instances with 10GB disks. Hosts have spinning hard drives.

First encounter:
Running Yahoo Cloud Storage Benchmark with 95% read and 5% write against node1 it eventually filled the disk of that node and it shut down. The other nodes still had plenty of disk and kept operating.
In the admin ui it looked like the ycsb database had a very modest size, tens of megabytes.
I manually deleted all files in the data directory of node1 and restarted it. It came back as a new node and data was replicated back to it, filling it’s disk to about 90%. Now the admin ui shows the size of the ycsb database beeing 8 gigabytes and it is spread evenly across all nodes. I deleted the database and the disk space was returned after 25h by compaction.
Decomissioned the old node1.

Command: ./tpcc -drop=true -load=true postgres://root@roach1:26257/tpcc?sslmode=disable

Why did the disks not fill up evenly?
Why did the admin ui show a smaller size than it actually was?
Why did the disk not fill to 100% when node1 was rebourne and the data replicated back? Might data have been lost here?

Second encounter:
Running TPC-H, same setup. Doing 100% analytical reads this time. After all test data have been inserted and the benchmark reads start to happen the disk of node1 fills up pretty quickly. In about 4 hours it grows from 1.2 to 8.2 GB. At this point that node shuts down. The admin ui shows the size of the database as 2.8 GB. Node2 has 2.4 GB data and node3 has 8.2 GB.
I did the same thing, delete all data on disk and restart the node. Data is replicated back to the new node but this time it ends up with 2.3 GB data. Admin ui still shows size of the database as 2.8 GB.

How could the reads increase disk usage?
Is there a way to delete the data for one node but keep its identity. For example delete all SSTs?

Command: ./tpch -drop=true -load=true postgres://root@roach1:26257/tpch?sslmode=disable

Thanks for the report.

Most of the behavior you have seen is expected, but not all.

“admin ui shows smaller size”,and “reads increase disk usage” are mostly explained by the internal data usage. We store our own timeseries for use in the admin UI. This results in some overhead even when the cluster is idle. See our operational FAQ for details.

“why did the disk not fill to 100%”: exact disk used is a bit hard to determine, partially due to our extra data, and partially due to rocksdb’s representation. I would definitely not expect a new node to fill up to the level a just-deleted node was at.

“is there a way to delete the data but keep its identity”: not currently. The only way to delete the data is to delete it and wait for the garbage collector (tweakable through zone configs) do catch up. A method bypassing the normal process would be a reasonable candidate for a cockroach debug command, but we do not currently have it.

“why did the disks not fill up evenly” is the oddest part of your summary. Your “first encounter” does not mention the disk used for each node, but the second one seems to say: “node 1: full, node 2: 2.4GB, node 3: 8.2GB”. This amount of discrepancy is troubling. Could you provide a bit more details? Was this starting from an empty cluster? Is the hardware identical for all three nodes?

Hello, thanks for the quick reply :slight_smile:

It is not a completely fresh cluster. It has been running for some weeks and upgraded from 1.0.0 to 1.1.1 and 1.1.2. I have been running load tests on it a couple of times but mostly it sees no traffic at all. We use it to back our Grafana dashboard. Hardware is identical afaik.

So the first round of YCSB filled up all nodes but filled node1 slightly more because of internal data and that tipped it over the edge and the test stopped. That explains why the other two nodes survived.
But it is still odd that the size of the database was reported as <100MB until a third node was reintroduced.

Second occasion:
It seems surprising that performing only reads would fill the disk that quickly. I’ve cut and pasted some screenshots from the admin ui showing disk usage from the three nodes from before the first disruption until today.

11/17 Started running ycsb
11/18 node1 crashes because disk is full
11/20 node1 is brought back up
11/21 compaction, then tpc-h starts, disk fills up, node1 crashes
11/22 node1 is brought back up

A three node cluster with different available space for each node will definitely cause one to run out of disk sooner. No surprises there, and nothing we can so as long as you stick to the default replication factor of 3.

The < 100MB reported size is a little odd. It would be good to try and reproduce this, comparing the reported sizes for DB/stores/total.

For your second scenario, the tpc-h data is about 5GB on disk per node for a 3 node cluster according to the readme. Just that would get you pretty close to your per-node capacity. The disk used graph doesn’t show an increase after that, just the initial load.

Any idea why the nodes have so different amount of available disk space? they all have equal disk and only running CockroachDB

The graphs in the screenshot show the three nodes with the following used/available capacity on 11/16:
roach1: used ~1.5GB, available: ~7GB
roach2 and roach3: used ~1.5GB, available: ~9GB

The “used by cockroach” numbers are the same, but the available disk space is 2GB less on roach1. This is likely due to non-cockroach data on the disk. Could you check the disk for:

  • data outside the cockroach data directory
  • logs (I don’t think we include those in the used space metric)
  • backups/archives (rocksdb has those, but they have to be turned on explicitly so this is unlikely)

You are correct that there were some garbage files on the disk of roach1 to begin with. I’ve removed them now but that is not what I’m worried about. At the end of the graph roach3 are using ~5GB more than roach1 and roach2. I’ve checked and it is all roach data, mostly sstables. I waited well over 25h for compaction and then ran took that node down and ran cockroach debug compact. When I did so the storage was freed up.

So that’s all good now, but I am curious about why the data was not compacted automatically.

We don’t force rocksdb compactions, they happen based on rocksdb’s internal logic.
Running cockroach debug compact manually is also not something we particularly recommend.

As for the disk discrepancy, between node 1 dying and compactions being forced, it’s really difficult to tell what’s going on from a single week-long graph. For example, did you run compactions as well on nodes 1 and 2? It kindda looks like it from the graph, but it’s tough to tell.