Design consideration for hosting thousands of range replicas per node

Hi, I’d like to reason about the choice of small (MB) v.s. large ranges (GB) at per host level, and if would be great if you guys can share the thoughts.

Support we have a fix number of hosts, since CockroachDB has a small range size (64MB), a single host can have thousands of ranges, which forms thousands of raft groups with other nodes. By having small range size, range repair and movement would be faster, especially if snapshot will be streamed over the network. However, what’s the performance (throughput) implication of having small ranges in this scenario? Our preliminary experimental results with raft is that WAL sync is expensive, and we are not improving the per-node throughput at all by running multiple raft groups per node. I am not sure if you have similar experience when benchmarking CRDB. If so, can I say hosting thousands of replicas per node is mostly for reducing the total number of nodes? This to me affects many important design decisions such as multi-raft.

The biggest concern with range size is that sometimes (e.g. while receiving snapshots), we need to hold an entire range’s data in memory at once, so having ranges that are too big means you have bigger swings in memory consumption. However, we have already converted some of these things (e.g. sending snapshots) to use a streaming interface, and once we can use streaming interfaces everywhere it will be feasible (and probably faster) to use fewer, smaller, ranges.

Note that we have batching mechanisms in rocksDBBatch.Commit to allow multiple ranges to be flushed in a single fsync, so multiple ranges shouldn’t be competing too much for the same disk IO resources.

Suppose a host has 100k ranges and is receiving writes for each range. It seems to me that the batching that CRDB does for the raft log writes will be within a range. So it seems like you would have to do 100k fsyncs to write the raft log for these 100k requests?

For example, suppose two writes one for range 1 and one for range 2. Those writes map to two distinct etcd/raft Nodes. When you do the batch commit that I have linked to those will be separate invocations - one for the range with the range1 prefix and one for the range with the range2 prefix. If this is true, then how do you avoid the performance hit of all these ranges independently invoking fsync of each other?



As an experiment, I did the following:

echo "CREATE DATABASE benchmark;" | ./cockroach sql
echo "CREATE TABLE kv_store (key BYTES, value BYTES, PRIMARY KEY(key));" | ./cockroach sql
for i in $(seq 0 255); do
  tmpi=`printf '%02x' $i`
  for j in $(seq 0 255); do
    tmpj=`printf '%02x' $j`
    echo "ALTER TABLE benchmark.kv_store SPLIT AT VALUES (b'\x$tmpi\x$tmpj');" | ./cockroach  sql

I then did a lot of inserts of keys drawn from a uniform distribution and I did notice a throughput drop after a certain point (Looks like it went down by more than 80% after creating 40k ranges). I would expect introducing 2^16 ranges in a random distribution would create the fsync problem I alluded to. Can you explain how you are able to batch raft log commits with fsync that belong to different etcd/raft Node instances?

Also, I should note that I explicitly set:

SET CLUSTER SETTING kv.raft_log.synchronize = true;

It’s only with this setting that I would expect much worse performance with 2^16 ranges.

Here’s a graph showing how throughput decreases as a function of number of ranges. Originally the QPS is 2700/s with 10 ranges. With 2^16 ranges, the throughput has decreased to 560/s for a decrease of ~80%

This code in rocksDBBatch.Commit batches writes across ranges.

For clarity, rd.Entries will only contain the entries within a single replicated range (aka split), right, because you are using one etcd rawNode per replicated range? So the new rocksDBBatch only contains entries from a single split?

Can you point me to the code where it’s clear that rocksDBBatch is actually picking up entries to fsync from other splits? Looking at the code, it just seems like this batch is dealing with entries in a single split, rather than doing fsync across splits.

Each rocksDBBatch contains data from a single range, but in the code I linked above the actual commit to disk is coordinated through r.parent, which is shared across all ranges on the store.