Hardware recommendations and big fat nodes

I’ve read the hardware recommendations and other posts on this forum about node size, and still wonder how best to handle a situation where the only storage available is on-premise servers that have big JBOD disks on them. In some cases people are simply limited to specific hardware given to us (e.g. ops department has limited rackspace and wants to buy hosts with 22 disks attached). It would be nice to know the best way to operate in these conditions, even if it isn’t the ideal recommendation.

The documentation clearly states that individual cockroach nodes should not exceed 4TB of data, and that we shouldn’t run multiple instances on the same physical host. But clearly we can’t satisfy both of those conditions with this kind of hardware.

It looks like putting multiple cockroach instances on a single host can work okay as long as we are using locality effectively and including the hostname as the last key in our locality spec. This should ensure we don’t end up with all the replicas of a range being on the same host together, right? This seems like a better way to run than to try to have one cockroach instance handle all that data.

Hi Dan,

Unfortunately this sort of deployment is not something cockroach is likely to do supremely well in. Firstly, we’re not tuned for spinning disks. If you turn the cluster setting rocksdb.min_wal_sync_interval up to a few ms we might do okay. Secondly, we only recently got okay at rebalancing across disks see #6782.

I have long wanted to see such a deployment work. In particular I’ve generally assumed that a JBOD such as this commonly has a relatively small SSD which could be utilized for the commit log and then the disks for the data. This is something I would emphatically like to see but my guess is we’re about a year from it. Some of the internals of that idea are discussed in #38322.

A final issue is that disks die in jbods sometimes and you’d probably like to be able to swap them with minimal downtime. A blocker for that is #39415.

Thanks @ajwerner, sorry I was actually referring to SSD not HDD but didn’t specify. However it’s worth pointing out that individual SSDs on the order of 10TB+ is not uncommon.

Assuming the disks are SSD, I’m wondering what issues you see with a deployment where if a host has N disks, we run N cockroach instances on it, using a locality like region=...,dc=...,host=db-host1. to ensure all the replicas for a range aren’t on the same host. It seems like this wouldn’t suffer from #39415 nor #39415.

The note about rocksdb.min_wal_sync_interval is really helpful though since there may be workloads we need to run on HDD for now.

Cockroach actually supports running multiple disks per node out-of-the-box. See the first bullet in https://www.cockroachlabs.com/docs/stable/recommended-production-settings.html#topology. That will lead to the process sharing a block cache across all of them to utilize memory better. We’ll also ensure that no range ever occurs more than once on a given node, so no need for the locality.

In general the primary problems with more data are that there are more ranges and ranges require a certain amount of background work. In 20.1 we increased the range size by default to 256MiB avg from 32MiB avg. As your data volume grows, you may want to consider increasing it a bit further with zone configurations. Another trick is to reduce the rate at which the consistency checker. We have upcoming fixes for that in 20.2 (see #49763).

I start to worry when you start running more than ~10-20k ranges per node (depending on core count).

Yeah I’m aware it can support multiple stores, I just figured you wouldn’t want a single cockroach node handling 100TB+ of data, hence the idea of putting multiple instances on a host.

Thanks for the info, the tips around changing range size and consistency checker rate are helpful.

For now I’ll be trying out using a single cockroach instance with multiple store parameters, and try tuning the range size and consistency checker rate.