How to build a cluster in order to have full data replication between the nodes?

How can i build/configure a cockroach cluster in order to have all data in all nodes? I’m not talking about to have pieces of data spread across the nodes.

I’ve tried to start a cluster with locality flag but as soon i kill cockroach process in the machine which first recieved the data, i’m no longer able to execute a select and the command only return the data when i start this instance again.

In other words, i wanna to achieve a scenario that if i have total of 3 nodes and i lost 2 of them or they become unavaliabe, i still want to work normally.

In case that cockroachdb does’t support this feature. How is the best way to achieve it?

After some time and whitout flag locality i was able to connect on the second node and retrieve a select from a table with the first node down.However, now i’m trying to select other table with 3 nodes up and no data is been retrieved with select, just a forever waiting with no errors.
Cockroach console shows 37 replicas in each node. Now, after some minutes a error message came back from that last select: SQL Error [08006]: An I/O error occured while sending to the backend.
Node 1: Mac os, cockroach 1.5; node 2 and 3: Debian 9, cockroach 1.6 same machine; no container at all.

Hi @Linux10000,

CockroachDB replicates your data 3 times (by default) and uses the Raft consensus algorithm to guarantee consistency between replicas. This means that a majority of replicas needs to be available for the cluster to make progress. In the case of a 3-node cluster, therefore, losing 2 nodes means losing consensus: the cluster becomes unresponsive.

There are other node/replication configurations that may get you closer to what you want, but you’ll never be able to have a 3-node cluster survive the loss of 2 nodes.

It might be helpful for you to go through these intro training modules. They explain the reasoning and architecture behind CockroachDB and then guide you through some hands-on labs to experience how replication and fault tolerance work using a simple local cluster.

Best,
Jesse

2 Likes

Thanks for the answer.
Perharps i need to rethink what i’m trying to achieve.
For now, i’ll think about what you said.

Fair enough.
And in case a cluster (of 3 nodes) has been running in my place and for any reason i lost those 2 machines?
Theorically, this single machine doesn’t need consensus. Well, i think if i can start with a single node, i could back to it. Make sense, right?
In other words, i wanna come back from cluster to single machine (maybe permanently). How should i proceed?

Currently, your only option would be to restore the cluster from a backup. However, for a future release, I believe we are investigating other disaster recovery options. Here’s a related GitHub issue.

@dianasaur323 or @bdarnell, are we looking into ways for a cluster that loses consensus to come back online at a lower replication factor (or no replication)?

1 Like

What’s the implication of increasing the num_replicas for a cluster as you increase nodes. Does it play any part in the fault tolerance formula i.e. (n - 1)/2.

For example in a cluster spread across 3 AZ’s (single region) does increasing the number of nodes or num_replicas beyond a 3 node cluster provide any benefit if I want to tolerate a single AZ failure ? Is there anyway to tolerate loss of 2 AZ’s ?