Experiments on disaster recovery

replication

(Gaetano) #1

I have created a cluster with 6 nodes, 3 for each datacenter (I’m using version v2.0)
n1 <=== D1
n2 <=== D1
n3 <=== D1
n4 <=== D2
n5 <=== D2
n6 <=== D2

I started an application writing on n1. So far so good. The database I have created has a replica factor of 6.

I have shutdown cockroach on n5 and n6 (pkill cockroach) and so far so good. Then the disaster:

I shutdown n4 as well the application stuck (I expected this) but from this point on I was not able to recover the entire cluster readding that last removed node, restarting n4 the cluster was still down. Connecting to cockroach console on n1 gives me: “Connection to CockroachDB node lost”. At a certain point the client I had (c++ using pqxx) terminated with:

“terminate called after throwing an instance of ‘pqxx::sql_error’
what(): ERROR: waiting on split that failed: split at key /Table/52/1/335888366539702275 failed: context deadline exceeded”.

Eventually to recover the cluster I had to restart n5 and n6 as well.

Is this expected?


(Peter Mattis) #2

@kalman No, this is not expected. If you set the replication factor to 6 your cluster should be able to tolerate 2 node failures. When you took a 3rd node down, you should expect to see data unavailability, but bringing any of the down nodes back up should restore the cluster. Can you provide the exact command you used to change the replication factor? My suspicion is that you changed the replication factor for only part of the cluster and thus accidentally caused data unavailability by taking down 3 nodes.


(Gaetano) #3

$ cat db.zone
num_replicas: 6

$ ./cockroach zone set test_db -f db.zone --insecure
range_min_bytes: 1048576
range_max_bytes: 67108864
gc:
ttlseconds: 600
num_replicas: 6
constraints: []


(Peter Mattis) #4

In addition to the database and table zones, there are also zones for the “meta” and “liveness” ranges. I suspect that either the “meta” range or the “liveness” range became unavailable during your experiment. Could you try:

~ echo "num_replicas: 6" | ./cockroach zone set .meta -f - --insecure
~ echo "num_replicas: 6" | ./cockroach zone set .liveness -f - --insecure

I can see that the necessity of doing this is somewhat unexpected. If this resolves the problem, we’ll put some thought into how to make it clearer that you need to up the replication factor for the system ranges.


(Alex Robinson) #5

echo "num_replicas: 6" | ./cockroach zone set system -f - --insecure may also be needed for the system config tables (e.g. system.descriptor to be available)


(Gaetano) #6

I can confirm that with:
echo “num_replicas: 6” | ./cockroach zone set .meta -f - --insecure
echo “num_replicas: 6” | ./cockroach zone set .liveness -f - --insecure
echo “num_replicas: 6” | ./cockroach zone set system -f - --insecure

everything works as expected. At this point I’d say that num_replicas for internal system table should automagically be increased when a new node is added to the cluster, at the end final user doesn’t want to manage that as well.

I just found glitchy the internal timeseries, I don’t know where those are stored


(Peter Mattis) #7

Thanks for the update. I’ll file an issue about this and we’ll investigate how we can make the behavior less surprising.


(Gaetano) #8

I’m going to try what happens if an entire datacenter goes down for whatever reason, and I would like to reduce replica from 6 to 3, would that make my cluster back alive ?


(Gaetano) #9

It didn’t work for the simple fact the cluster didn’t even accepted my replica change command :frowning:


(Peter Mattis) #10

I’m not exactly clear on what you did. If you have 6 nodes in your cluster and you take 3 down, you’re not going to be able to do anything with your cluster if your replication factor is 6. In order to survive a datacenter outage you’ll need 3 datacenters or a zone config setup such that a majority of the replicas are outside of the failing datacenter.


(Gaetano) #11

I can not ask for a 3rd datacenter :smiley:
With that majority of replicas outside you mean that with 6 nodes cluster (3+3) the replica factor can be higher than 6? I’m thinking a setup that is able to work as a cassandra 3+3 with a local quorum of 2.


(Peter Mattis) #12

CockroachDB won’t add more replicas than the number of nodes. What my statement was intending to convey is that if you need to survive a datacenter failure you need to have a majority of replicas outside of the failed datacenter. So if you 2 datacenters, one of the datacenters should contain 3 replicas and the other 2 replicas and they you can survive failure of the datacenter with 2 replicas. Clearly you can’t survive failure of the datacenter with 3 replicas so this isn’t a great solution. We’ve had other requests to support datacenter failure in a 2 datacenter setup, but doing so runs against the consensus replication that is at the heart of our replication system.


(Gaetano) #13

Well, isn’t a cluster 3+3 with replica of 6 and a datacenter completely down the same as single datacenter and with a replica of 3?

Having say that bringing a 4th machine on the survived datacenter will fix the entire cluster?


(Peter Mattis) #14

Filed https://github.com/cockroachdb/cockroach/issues/24430 about reducing surprise when increasing the replication factor on a database/table.


(Peter Mattis) #15

Well, isn’t a cluster 3+3 with replica of 6 and a datacenter completely down the same as single datacenter and with a replica of 3?

No, it is not the same. Consensus replication requires a majority of nodes to be available to make progress. With 6 replicas, a majority is 4 nodes, not 3. The underlying reason we can’t make progress with 3 nodes is that we don’t actually know what the other datacenter is doing. The reason the other datacenter is down could be a network partition and we do not want to allow a split-brain scenario where both datacenters continue to allow the cluster to be modified.

Having say that bringing a 4th machine on the survived datacenter will fix the entire cluster?

Once the cluster is in its unavailable state, adding another machine won’t help. This relates to the split-brain scenario I mentioned above. If we allowed your live datacenter to add a new machine and continue operation, what is to prevent the other datacenter from doing the same? For consensus replication protocols like Raft (this is what CockroachDB uses) and Paxos, we need a real majority of nodes to be available.

That said, we are exploring ways to recover when there are insufficient replicas to form a quorum. The details have not been fleshed out yet. Doing so will likely either require manual intervention, or result in the possible loss of some of the most recently written data, or both.


(Gaetano) #16

Ok, I see. It would be good a “you know what you are doing” option to force a downgrade of replica factor, it’s a shame having a working datacenter with 3 replica for each range in place and not being able to use it. Thanks for the clarification


#17

So the moral of the story here is that you need to have 3 data centers to survive a data center outage? That’s a really big deal breaker for most people, myself included. Having two data centers is usually a luxury for most companies, and the expectation will be to have a product that can seamlessly fail over when one is down. Having some method of promoting a single DC to it’s own cluster should be a top priority as many people are probably not aware of this limitation.


(Peter Mattis) #18

@somecallmemike Yes, that is the moral of the story. If you’re using another system which is providing replication between only 2 datacenters and guaranteeing consistency, either their guarantees are weak or exaggerated.


(Gaetano) #19

True. But here I don’t want an automatic replica downgrade, I would like to do a downsize of my cluster in order to make it operable again. A sort of “change replica to 3 with the constraint to be on the alive datancenter”, and a big flag to that command: “yes i know what I’m doing”.


(Peter Mattis) #20

@kalman We’re working on supporting exactly this scenario for the next release (2.1). There will likely be a number of caveats and a “yes I know what I’m doing” flag.