Storage on multiple disks

I’m in the planning and research phase of using Cockroach. Right now I’m researching what hardware I need. I’m planning to start w/ 1 TB SSDs on 3 nodes, but obviously data will grow and someday I will need more.

  1. If I have multiple disks per server, and I target them using the --store parameter, how does Cockroach work with that? How does it split the data files up, which go to which disk.

  2. Can you host specific databases on specific disks or should I just let CDB do it’s thing?

  3. Do you recommend getting one large drive, like a 1tb NVMe, or several smaller 240 ssds. The smaller SSDs are very cheap and would spread IO across multiple drives.

  4. If a disk becomes full, can another disk be added and included in the --store?

  5. If a disk dies, can I restore the data to a replacement disk, and re-start CDB?

I hope these are good questions. I did as much google as I could to get answers but I couldn’t find them. My data will grow infinitely and I just want to make sure I understand how the disk is being used.

assuming you run at least 3 different crdb nodes, if a disk dies, you don’t need to manually restore the data, just replace the dead disk, restart crdb and once it rejoins the cluster, it will copy what it needs on its own.

I don’t know enough to answer the other questions.

You can always use a volume manager. Doing so enables you to add disks to a volume group and to increase filesystem sizes beyond that of the size of your disks. This is a pretty common practice since having more spindles spinning for your database could improve the performance a lot. Even for SSD’s it could be smart to have multiple adapters to prevent that the adapter is a bottleneck.

Thanks, I’m definitely going to go research that. I’m still curious though, how the --storage flags work with multiple disks. I’m hoping someone can clarify that.

The unit of storage in CockroachDB is the range, which can grow to approx 64MB (configurable). By default ranges are balanced across stores on each node, up to the store capacity. So without configuration the data will more or less spread across the disks.

You can definitely let CockroachDB “do its thing”. You can also fine tune data placement using store attributes and zone configurations.

Yes they will enable to spread I/O. However if write performance is the bottleneck in your application, NVMe will provide more write throughput.

yes, see above.

Yes, or as another reply suggested you can also wipe everything and create a new node; it will repopulate automatically from the data on the other nodes.

I hope this helps? Let us know if you have more questions or comments.

1 Like

Everything @knz has said is correct, but I have one caveat to add.

Cockroach will not rebalance two copies of the same piece of data to the same node, even the two copies would be on separate disks. This makes our rebalancing not work very well on a 3-node cluster where each node has multiple disks, because it has no good built-in mechanism for moving data from one disk to another. If the cluster has at least 4 nodes, rebalancing will work more like you expect across disks and nodes.

This is a limitation of our current implementation that we’d like to improve in the future, but for now it’s an unfortunate fact that a 3-node cluster with multiple disks per node won’t balance data particularly well across the disks. For more details, see this issue:

1 Like

Thanks, this was extremely helpful.

@a-robinson are you talking about instances where people want to balance the same data on 2 disks?

My use case is not that. I’m just trying to figure out how to buy and maintain hardware that is best for CDB. Does your comment still apply if there is only one copy of the data per node?

@bladefist I’m saying that if your cluster has only 3 nodes and each node has multiple disks, the the data will not be balanced very well across all 6+ disks. Clusters with multiple disks per node only balance data well across disks if there are at least 4 nodes.

@a-robinson Can you explain or link to an explanation of why 4 is needed? The recommended production setup has 3 nodes. I figured 3 was needed for a quorum. Thanks!

There needs to be 3 nodes for quorum. With only one store per node, this is sufficient and the data will be balanced.

The reason why data cannot rebalance well when there are 2+ stores per node and only 3 nodes is the following: in order to move a range replica from store A to store B on a single node (say, node 1), the cluster must first create a new replica on store B then remove the replica on store A. Unfortunately our consensus protocol does not allow CockroachDB to have 2 replicas simultaneously on the same node ( So a direct migration between stores is not possible (yet). To achieve this rebalancing between stores, CockroachDB must create the new intermediate copy on a different node, one which doesn’t have the data yet. This is why a 4th node is needed.

For example: say you have 3 nodes n1, n2, n3, each with stores A and B.
Initially range rx is replicated across n1:A, n2:A, n3:A.
Now we want to rebalance from n1:A to n1:B.
It is not possible to create an additional replica on n1:B directly because otherwise at that moment you would have a replica on both n1:A and n1:B, and we don’t support multiple replicas per node yet.
It is not possible to create the additional replica on n2 or n3 either, because they already have a replica.

So instead we add a new node n4. Then the copy of n1:A can be made on n4.
Then the replica on n1:A can be deleted.
Then the copy from n4 can be migrated to n1:B.

Then n4 can be removed if not needed any more.

Note this is a limitation of the current protocol and we may remove this limitation in the future.

Does this clarify?

@knz It does. Perfect. This thread was very helpful and you guys are great. Thanks for your prompted and detailed answers!