Understanding replication semantics on a single node deployment

Hi there!

Firstly, thank you for this awesome system!

I have just started running cockroach db on my local machine and have some basic questions regarding replication of data.

I can see from the admin interface that replication factor is set to 3; are there 3 replicas of data being stored on the the single instance of cockroach? Or is the data under-replicated?

The general doubt I had was if there was a 3-node cluster and 2 of nodes died due to some reason, will that situation be identical to my current toy deployment? I can write data currently to my single node deploment. When will cockroach db fail to accept writes/reads due to sufficient running nodes not available?

Please let me know if there is any documentation section that can help me understand this better.

–
cheers, gaurav

Hi Gaurav,

when you first spin up a cluster, and until you add a 3rd node to it, the cluster works in a “special” mode which enables use of only 1 or 2 replica for testing purposes. Until that point the benefits of replication and the mechanism that prevents transactions to commit until there are 2/3 replicas available are not (yet) enabled.
As soon as you add a 3rd node the full transactional protocol kicks in and you likely won’t be able to operate with only 1/3 of the nodes alive.

Does this clarify?

Meanwhile it is perhaps also true that our documentation does not highlight this sufficiently. Thanks for asking the question, we will look at how to explain this better!

Thanks, that clarifies things.
Follow up question - is there an explicit setting that prevents the “experimental” mode from kicking in - when set, the cluster will only take writes/reads when necessary quorum is available?

EDIT: Something went weird when I was running this the first time and the instructions were incorrect. Updated to reflect actual cluster behavior.

Well, “quorum” in this context is defined as strictly greater than half the nodes available. For 1 and 2 node clusters, that’d mean the entire cluster must be available.

To demonstrate it, you can do the following steps:

$ cockroach start --background
$ cockroach sql
root@:26257> CREATE DATABASE example;
root@:26257> CREATE TABLE example.kv (key STRING PRIMARY KEY, value STRING);
root@:26257> INSERT INTO example.kv (key, value) VALUES ('cockroach', 'rocks!');
root@:26257> ^D

Now that we have a single node cluster and some test data, let’s add some nodes for it to replicate to and see how it behaves when they become unavailable:

$ cockroach start --background --store=cockroach-data-2 --join=localhost:26257  --port=26258 --http-port=8081
$ cockroach sql
root@:26257> SELECT * FROM example.kv
+-----------+--------+
|    key    | value  |
+-----------+--------+
| cockroach | rocks! |
+-----------+--------+
root@:26257> ^D
$ cockroach quit --port=26258
$ cockroach sql
root@:26257> SELECT * FROM example.kv

This call hangs, since a quorum of 2 nodes is 2, so either one becoming unavailable makes the remaining nodes unable to establish a quorum, and it stops accepting reads or writes. Now we can go up to 3 nodes to see the replication mechanism in action:

$ cockroach start --background --store=cockroach-data-2 --join=localhost:26257 --port=26258 --http-port=8081
$ cockroach start --background --store=cockroach-data-3 --join=localhost:26257 --port=26259 --http-port=8082
$ cockroach sql
root@:26257> SELECT * FROM example.kv
+-----------+--------+
|    key    | value  |
+-----------+--------+
| cockroach | rocks! |
+-----------+--------+
root@:26257> ^D
$ cockroach quit --port=26259
$ cockroach sql
root@:26257> SELECT * FROM example.kv
+-----------+--------+
|    key    | value  |
+-----------+--------+
| cockroach | rocks! |
+-----------+--------+
root@:26257> ^D

This call still works because we only quit 1/3 nodes in our cluster, so we still have quorum. Let’s kill one more node so that we lose our quorum.

$ cockroach quit --port=26258
$ cockroach sql
root@:26257> SELECT * FROM example.kv

This call now hangs, since we can’t establish a quorum with only 1/3 of the nodes up.

There was another question posted about shrinking a cluster back down to a single node, but long story short, once the replication and quorum requirements start to be applied, doing so isn’t well supported at the moment.

The “special” mode mentioned above is, simply put, one at which data has less copies than it wants (note that when a node is down, the data still “has the same number of copies” but one copy is not available - that’s slightly different from the situation here, where the cluster doesn’t even think there are any copies which are not available). When the node starts, it wants three copies of each Replica but it has one. When you have three nodes and then one dies, after a while it will give up on the third node and “size down” the number of copies it thinks it can realistically expect, i.e. it’s in a “special mode” again where it has two copies (instead of the three it wants).
As @twrobel points out above, you can’t repeat that trick and go back to one node since a quorum of two nodes is two, i.e. when you kill one of them then the other is completely toast.

I thought I understood @knz’s reply but after reading next two posts I am a bit confused. Please bear with me :slight_smile:

  • I tried the exact steps that @twrobel has outlined - when I expand the cluster to 2 nodes and then kill one of them, the queries do not work any more (they hang). The Admin UI shows 0 nodes in the cluster. [in contrast, it was suggested that going from 2 to 1 is allowed as a “special” case (if 3rd node was never started)]. The linked post also states the same - going from 2 to 1 is not allowed. What am I missing?
  • If this was allowed (going from 2 to 1), then would it not be dangerous? before the third node joined cluster, if node1 and node2 got partitioned, they could each accept different writes for same key?

  • Is there some setting that can completely eliminate this automatic switching between “test/special” and “replicated” mode and let the cluster begin in fully replicated mode? I understand this is very nice feature for trying out the db but probably not something I would want to have when spinning up a prod cluster…

$ ./cockroach start
build:      beta-20161013 @ 2016/10/13 20:08:17 (go1.7.1)
admin:      http://localhost:8080
sql:        postgresql://root@localhost:26257?sslmode=disable
logs:       cockroach-data/logs
store[0]:   path=cockroach-data
status:     restarted pre-existing node
clusterID:  {51aed2a3-e654-4e46-85d8-d64b281aad8c}
nodeID:     1

You’re correct, going from two to one node is not possible (I wrote that in my post, but didn’t read the previous post well enough - I think @twrobel might’ve gotten lucky that when he killed the second node, it hadn’t yet “accepted” copies of the data on the first node).

Hmm, right you are. I’m not sure what happened there… It may have just been that I was running through the steps too quickly locally and there wasn’t time for replication to occur. So there is no time where a quorum isn’t enforced, is that right @tschottdorf ?

I’ll edit my above comment to prevent confusion for others :slight_smile:

Also, that screenshot posted by @gaurav is… odd… How would the web console be able to report 0 nodes present? Isn’t the fact that the web console is being served indicative of at least 1 node being running (the one the console is being served from)?

I wouldn’t trust the ui too much when the cluster is unavailable. One should be able to trust it, and I think it should still show that one node is up, but it won’t be able to load much of the data it usually displays, and perhaps that clogs it up enough so that it doesn’t get around to updating the node count. Something to look at, though. We haven’t put the ui through many stress situations.

1 Like

Yes, a quorum is always necessary (i.e. the consensus algorithm never cheats). I also think you ran the commands fast enough so that in reality, you were always running with all the data only on the first node.

Thanks for the clarifications!