Scaling questions with respect to replication


  1. A node in CRDB can hold many (100k+) ranges so it seems common that for any two nodes in a CRDB cluster they will both be members of the same raft range. Thus the communication overhead for replication is at least (1/hearbeat_interval_ms)xNumberOfNodes (i.e. the sum over all nodes of (1/hearbeat_interval_ms)x(NumberOfNodes-1)). And this assumes the multiraft batching - otherwise the overhead would be (1/hearbeat_interval_ms)xNumberOfRanges?

  2. How does the auto-rebalancing algorithm work? Is there a preference for trying to reduce the communication overhead mentioned above? What tradeoffs are taken into consideration for rebalancing optimization?

  3. Couldn’t it be possible that a transaction that spans many ranges could end up creating O(NumberOfNodes) overhead due to the first question? It seems possible that I could be interested in some dataset such that the raft leaders that owns that dataset includes every node in the cluster?

  4. What’s the fundamental scaling bottleneck that keeps CRDB from scaling past 64 nodes? Is it the gossip mechanism (seems unlikely)? Is it multi-raft with O(NumberOfNodes) overhead (seems unlikely)? Is it the possibility of pathological behavior from my hypothetical third point?



  1. We have several more layers of optimizations now. Quiescent ranges mean that we no longer send heartbeats at all for most ranges, and epoch-based range leases limit the amount of traffic to maintain range leases and detect failures.

  2. That’s a big question; the best docs we have on this are the rebalancing v2 RFC and the more recent leaseholder locality RFC. This will be an active area of development for the foreseeable future. We do not currently attempt to minimize the heartbeat overhead of #1 via replica placement; we assume that each node will be connected to every other.

  3. Transactions have costs proportional to their size, and transactions that touch many nodes are more expensive than transactions of the same size that touch fewer nodes. But none of these costs are directly related to the heartbeat traffic discussed in #1.

  4. We don’t believe there are fundamental scaling bottlenecks that will keep CRDB from scaling past 64 nodes. However, our testing with larger clusters has been limited, and until we have had a chance to test more with larger clusters, we have chosen to recommend that CRDB be used with clusters smaller than 64 nodes. Once we achieved stability with 64 nodes we decided it was better to improve performance for 64-node (and smaller) clusters than to keep pushing forward on the number of nodes.

1 Like

Thanks Ben,
One follow up question:

  1. I imagine all CRDB nodes have to store the same information with respect to schema and data placement and other cluster-wide information? That design would thus seem to bottleneck the database in terms of the capacity of a single node? Are there any such single node capacity constraints in CRDB that affect its ability to scale? E.g. can I have at most 100 CRDB tables or at most 1m distinct key ranges or at most 10k CRDB nodes due to such limitations?



The “data placement” information has been designed to scale. Not all nodes need to store all the information; the source of truth for the data placement is the database itself - there’s a two-level index containing “RangeDescriptors”. A node only caches in memory descriptors that it has used recently and the descriptors of ranges for which it’s currently storing a replica.

Wrt schema information - that’s currently a grayer area. A node will generally store in memory “TableDescriptors” for all the tables that it has ever accessed. This cache is sometimes cleared on schema changes and such events, but otherwise doesn’t expire. So there is some limit there that we have yet to quantify, but it surely isn’t 100 tables. I wouldn’t worry about it unless you’re planning on having extreme numbers of tables - e.g. running some sort of multi-tenant systems where tables assure namespacing between gazillions of tenants.

Beyond this, I don’t know of any per-node in-memory state that needs to scale with the amount of data. There’s things that scale with the number of nodes, but we’re far from the point where a node’s ability to track the other n-1 is our limitation. Other types of scaling bottlenecks, besides memory bloat, will show up for sure as we run larger and larger cluster.