[Bootsrapping] minimum number of nodes

Hi
Is there a way to define a minimum number of nodes to enable read/write to a roach cluster?

We are designing automatic bootsrapping of a cockroach cluster in our system. We have one roach node on each machine in our system.
One of the failure scenarios we thought of:

  1. node 1 bootstraps, it sees that no other roach is up, grabs a consul lock, and start a roach process (without join command).
  2. Some data is written to node 1 (no other roach was booted yet)
  3. node 1 dies.
  4. node 2 boots up, sees that no other roach is up, grabs the consul lock (that is now free because of the lock healthcheck), and starts a roach process (without join command)
  5. other nodes boot up, join node 2.
  6. node 1 restarts, see that there are other nodes, tries to join and fail because of data corruption.

In this scenario we have data loss (the one that was written to node 1) and node that can’t run a roach process (node 1).

A way to solve this is to enforce minimum number of nodes in a cluster before enabling read/write operations. This will ensure that we resist (N-1) node failures before we have data-loss.

Do you have any other way to solve this?
Any general thoughts about it?

Cheers!

1 Like

The recommended way is to start all nodes with the same --join flags. They will sit and wait because no cluster exists until you run the cockroach init command (see manual deployment for an example) against one of them.

The init command tells one of the nodes to initialize a new cluster. As long as that node is in the --join flags, other nodes talking to it will join the cluster.
After that, no new cluster will be created, nodes will always try to join an existing one.

In the scenario you described, you would ideally have started some number of nodes then run the init command.
After that, if all nodes are down and node 1 starts, either:

  • it was previously part of the cluster: in this case, it will have to wait for other nodes to be up before serving anything
  • it was never part of the cluster: it will keep pinging the nodes in its --join flag until one responds

In all scenarios, cockroach requires a quorum of live nodes to be able to serve data. With a default replication factor of three, it means that for every range (unit of replication in Cockroach), you need two of its replicas to be alive.

WOW, thank you so much for the quick reply!

Are you assuming that the “init” node is constant? what happens if my system is dynamic and there is no constant “init” node?
It is bad for us, since the “init” node might fail, and then we won’t have any roach cluster… and this will enforce manual intervention.

About the quorum: The problem is that before any node joins, and we have only one node in the cluster, it is considered a quorum, and we can write data to it.

Once a cluster is initialized, no node is special. You run the init command only once in the entire lifetime of a cluster, this just tells a node (any node) to create the new cluster. You should then have enough nodes in the cluster to ensure that your data survives. A minimal recommended cluster is 3 nodes in which case you can lost no more than one.

If you start with a single node, initialize it, then take it down, your cluster is obviously gone. Any other node will have nothing to connect to. The options then are to either 1) wait for the cluster to come back up or 2) create a whole new cluster.

A single node is a quorum because there is only 1 replica. The moment you add a second node, you’ll have two replicas and you need both to be up. When you add a third node, you’ll finally reach the desired replication factor of three and you can lose one node.

Yes, but there will still be a period of time that data loss is possible, assuming that a node that went down might not come back.

Do you really think that adding a flag of minimum number of nodes is not useful?

Data loss is possible if you lose your only node. It’s really up to you to make sure that you start a three node cluster. Once your data has been replicated to three nodes, it can’t go back down.

Having a hard limit on the minimum number of nodes would need special-casing to start a cluster, meaning we would still have to allow data starting a 1 replica and being up-replicated to 3.

I see,

Thank you so much!

You guys rocks! :slight_smile: