What happens when we init a cluster of only 2 nodes?

I’m putting together a roach cluster on Raspberry Pis.

I managed to fry one of my Pi while tackling another project over Christmas so I only have two for the cluster at the moment.

I started cockroach on them both and issued an init command from my laptop.

Everything went without error and appears to be up and running nicely.

But the UI shows 20 under-replicated ranges. When I start creating a database and filling it up on one, that increases to 23.

Both nodes show up as healthy but there is no replication going on.

Is this normal for 2 nodes?

CockroachDB is a new thing for me, so this is just a hunch. Check out the replication factor which defaults to 3 nodes https://www.cockroachlabs.com/docs/stable/demo-data-replication.html#step-5-increase-the-replication-factor
And change it to 2 nodes (instructions in the link above at ‘Step 5: Increase the replication factor’), see if this helps.

Hey @heater,

replicating data consistently two ways is worse for availability than leaving it unreplicated. You can explicitly specify that you want two replicas (what Vlad suggested) in which case it’ll still do it, or you can add a third node.



Ah, thanks.

I only want to see 2 nodes working as a cluster because I blew up my third Pi the other day. As I’m deep in the forest 600Km east of Helsinki for the winter break I can’t replace it so fast.

I’ll try the replication factor of 2.

Hmm… the command to set num_replicas as given in the docs does not work:

$ echo 'num_replicas: 2' | cockroach zone set .default --insecure -f -
Error: unable to connect or connection lost.

Please check the address and credentials such as certificates (if attempting to
communicate with a secure cluster).

dial tcp [::1]:26257: getsockopt: connection refused
Failed running "zone"

If I add a host to connect to it refuses to do it:

$ echo 'num_replicas: 2' | cockroach zone set .default --insecure --host= -f -
Error: pq: could not validate zone config: at least 3 replicas are required for multi-replica configurations
Failed running "zone"

Two-replica configurations are not allowed because they are less reliable than a single replica. If either of your two replicas fails, the cluster is unable to recover, since you need a majority of nodes to be alive (and a majority of 2 is 2).

If you just want something to play around with while you’re waiting for a third device to arrive, you could comment out the line that gives that error and rebuild, or run two cockroach processes on each device (on different ports). But for any production usage, you’ll need at least three nodes (or perhaps a single node with reliable storage such as RAID or EBS).

Thanks Ben,

That is the conclusion I came to over the New Year.

On the other hand I’m not quite sure I get it.

If two out of two nodes are running they can reach a consensus and continue with reading and writing. A majority of 2 is 2, right?

If one of the two fails there is no consensus possible and the service is unavailable.

That is indeed less reliable than one node. Because we have doubled the probability of a node failure and hence total outage.

But for my testing here I don’t care.

Anyway, I’m going to leave it till I get back to civilization and can pick up another Pi.

My production setup is currently 3 nodes in three different google data centers. It has been running very well for some months now. Happily survived google rebooting one of the nodes for whatever reason, when I was not looking.

Whilst we are here: From what I understand Cockroach is based on Raft. Also Raft does not cater for Byzantine faults. I was kind of wondering what happens when all nodes seem to be up and running fine but one of them is busy corrupting my data some how. Would that ever get detected anywhere?

What are you not quite sure you get? You’ve summarized the situation accurately.

In this case you don’t care, but most of the time it would be very bad if you put yourself into a state where the loss of either of your two nodes would lead to data loss, so we don’t let you do it. (It would be possible to add a “yes, I really mean it and understand the risk” option, but there’s no real use case for this kind of configuration so we don’t think it’s worth the complexity)

We have multiple levels of checksums to protect against accidental corruption caused by things like bad disks (for example, see this blog post). However, if one of the nodes were replaced by malicious software (which still had access to the necessary TLS keys), it could corrupt your data without being detected. Using a byzantine fault-tolerant consensus algorithm would not be enough to prevent this: the malicious software could simply work at other layers of the system (for example, by modifying SQL queries on the way in before they reach the consensus layer).

Thanks again Ben,

I think that is the part I don’t understand. In a two node system consensus, majority, can be reached when all is well. When a node goes bad there is no consensus so no reads or writes can be done. At this point we have lost availability but I don’t see why that would cause data loss (Except new data cannot be written)

On the whole though I think you have made the right decision in not allowing a two node system. I’m all for avoiding complexity.

In a two-node raft cluster, both nodes must have a copy of each new command (log entry) in order for it to commit. So everything that has been written will be present in the log on every node. However, what can be lost is the knowledge of which log entries are committed and which are not. It is possible that a log entry is present on both nodes and committed on the leader, but the follower doesn’t know about this yet. If the leader dies before telling the follower that this entry was committed, the follower won’t be able to find out.

You could imagine a kind of “catastrophic recovery” mode in which the single remaining replica tries to recover what it can from its log, and making assumptions about whether any remaining log entries committed or not (in a two-node system, it would be safer to assume that every uncertain entry eventually committed. In a larger cluster, this wouldn’t be safe and you’d probably want to err on the side of dropping anything that wasn’t definitely committed). However, this process would play out on a range-by-range basis, and no matter how you resolve problems at the range level, you can end up violating higher-level SQL invariants.

We’re interested in providing some sort of recovery from a single replica, but we haven’t yet decided how that might work. Until then, it’s very important to replicate your data widely enough to avoid losing a quorum of nodes, and to make backups.

Thanks for the clear explanation Ben.

Funny you should mention “recovery from a single replica”. I was just pondering the nature of backups for cockroach system.

Wisdom has it that for backups:

  1. They should be off site.

  2. There should be more than one copy, or at least rolling versions.

  3. They should be verified for integrity.

  4. One should check that one can actually restore the system from them. Which implies a second roach cluster in this case.

These are all things that a roach cluster does already, in real-time, on live data. It keeps copies of data at multiple, remote locations, that data is consistent, and usable, and so on.

Arguably then, a roach cluster should not need any separatate backup system. It is it’s own backup.

Only, that is, if one can recover the data from a remaining node, or nodes, when disaster occurs and consensus is no longer possible.

And lo, I find you are already contemplating such a recovery scheme.