Is there a way to define a minimum number of nodes to enable read/write to a roach cluster?
We are designing automatic bootsrapping of a cockroach cluster in our system. We have one roach node on each machine in our system.
One of the failure scenarios we thought of:
- node 1 bootstraps, it sees that no other roach is up, grabs a consul lock, and start a roach process (without join command).
- Some data is written to node 1 (no other roach was booted yet)
- node 1 dies.
- node 2 boots up, sees that no other roach is up, grabs the consul lock (that is now free because of the lock healthcheck), and starts a roach process (without join command)
- other nodes boot up, join node 2.
- node 1 restarts, see that there are other nodes, tries to join and fail because of data corruption.
In this scenario we have data loss (the one that was written to node 1) and node that can’t run a roach process (node 1).
A way to solve this is to enforce minimum number of nodes in a cluster before enabling read/write operations. This will ensure that we resist (N-1) node failures before we have data-loss.
Do you have any other way to solve this?
Any general thoughts about it?