Multi-raft as a standalone module/library

Curious how hard it might be to run multi-raft as a stand alone module? I realize that approach was discarded at one point. I have interest in building a simple k-v service and would like to use multi-raft as a library.

How hard would it be to then take multi-raft and add splitting? My understanding is that CRDB does not yet support merging - how long before that is added?

We’ve discovered that multi-raft requires too much integration with the surrounding application to be easily implementable as a library. For example, splitting is handled almost entirely on the cockroach side and there’s not much that could be packaged into a multiraft library. We have no plans to factor this out into a more shareable form, although the underlying single-group raft implementation is available as etcd/raft.RawNode (and we do have plans to help factor this out of the etcd codebase).

Merging is halfway implemented: we can merge two ranges that live on the same nodes, but we don’t have the integration with the rebalancing system to move two adjacent ranges on the same nodes and keep them there long enough for the merge.

Thanks Ben,
At a high level, how is the etcd/raft implementation used in the CRDB multi-raft implementation? Looks like you maintain a RawNode for every key range? You mention that only three goroutines are used to drive all these state machines forward - can you point me to them? Lastly, what does the log look like - I assume you do not have 100k logs if the node has 100k ranges (and 100k rawNodes).

Thanks for your help

Yes, we have a RawNode for every key range. It’s not exactly three goroutines anymore. There’s a pool of workers in storage/scheduler.go and two more in the Store. There are 100k logs all multiplexed into a single RocksDB instance (see append in storage/replica_raftstorage.go).

In general, the main places to start if you’re interested in raft stuff are Replica.handleRaftReady, replica_raftstorage.go, the goroutines linked above, and Store.getOrCreateReplicaLocked.

Hi Ben,
Is this the correct venue to ask these questions - would it make more sense to use gitter.im?

Do I understand you correctly that you are using RocksDB commit log to do the fsync to disk for the actual raft log? So in rocksdb I guess the initial key prefix encodes the raft group and there is a descending stream of keys (timestamp being the susbsequent key part)? The underlying raft statemachine is the same rocksdb instance or a different one? Sorry if I totally missed the architecture here - let me know if there are any pictures that show the big picture.

With respect to scaling beyond 64 nodes - what are you targeting in your spring release? How big a database will you support? Multi-raft messaging ends up being the bottleneck of the system?

Thanks,

M

Yes, this is the best place for this discussion.

You’re right that we’re using the RocksDB to do the fsync for the raft log, and we encode the keys as you’ve described (but it’s a logical index instead of a timestamp). Each range is a raft “state machine”, all stored in the same RocksDB instance (one per store/disk).

We’re targeting up to 64 nodes in the spring release. I’m not sure what that translates to in size. Between coalesced heartbeats and quiesced ranges, raft-related traffic is not a scaling bottleneck at this point, although the raft commit is the limiting factor on latency.

Wanted to follow up on this thread. Looks like CRDB will be moving the Raft Log out of RocksDB soon, according to this ticket. What’s the plan on the datastructure that will be used for the Raft Log? Will it be a new RocksDB instance, as suggested by smetro? Seems like it can’t just be a simple logfile due to the range-centric Raft ordering and it can’t be many logfiles either.

Thanks,
S

We don’t know yet. We will certainly need to do some sort of cross-range batching of the log writes, but we haven’t settled on a design yet. There will be an RFC when we’ve got something to discuss, so following that issue is the best way to stay up to date.