Per-split replication design choice

I am curious why CRDB chose to assign a Raft cluster to every split (multi-Raft) in exploring the replication design space. The major downside is that the network and storage layers must to be tightly coupled to the raft implementation for replication efficiency and performance.

An alternative design would be to allow a Raft cluster to host many splits. If you split a range into two subsplits, the resulting splits could still be replicated by the same Raft instance. One would need to add a protocol that moves splits among Raft clusters.

What is the main benefit of multi-Raft? I guess the common behavior is that one splits ranges due to size. With Multi-Raft after a split you then have two Raft clusters and perhaps you use Raft add and remove configuration operations to migrate the Raft clusters to other nodes. Multi-Split-Raft would separate replication and data organization, which seems to provide more flexibility.

A protocol to move splits among Raft clusters would be quite tricky to design and implement. Do you have a design in mind for this? All the approaches that come to my mind would probably result in exactly the kind of tight coupling between network and storage layers that you’re trying to avoid here.

Splitting and merging are already the trickiest parts of the system because they involve interactions between multiple Raft groups. You could take that complexity out of the split operation itself, but it still has to go somewhere. With splits, we have the advantage that we’re creating a new raft group out of thin air, which simplifies things quite a bit. If we split ranges in place and then later moved them into other existing raft groups, I think that would be more difficult.

What is the benefit of splitting a range while leaving it in the same raft group? The main reason we split is to reduce the size of the raft group because that is the unit of replication and recovery (so it controls the size of snapshots that must be sent and the log that must be replayed). If you’re willing to let one raft group grow larger, you can just increase the maximum range size and not split it at all. We plan to increase the maximum range size over time as we move towards streaming interfaces as much as possible and reduce the burstiness of memory usage when ranges are large.