Access to minority / majority side during partition


(Devangana Tarafdar) #1

Hello,

I had a question regarding the behaviour of the cluster immediately after a net split and before the cluster becomes unavailable on the minority side. When there is a network partition , then on the minority side, if the lease holder for some range is present, then it seems that reads can continue for some time (depending on the query) but writes stop at once. How much is this time usually ? So these are stale reads correct ?

Also on the majority side, what happens when the net split happens, for reads/writes when the leaseholder is on the minority side. Does the cluster immediately choose a new leaseholder ?

Thanks ,
Devangana


(Tim O'Brien) #2

Hi @dtaraf,

As an example, let’s assume you have a 3 node cluster. Node 1 becomes unavailable to the rest of the cluster due to a network partition. If n1 is a leaseholder for a range, it will serve reads for that range for the remainder of its lease. If n2 or n3 receive a read request for a range where n1 is the leaseholder, the request will fail since n2 and n3 cannot contact n1. All nodes will be unable to complete writes, since n2 and n3 cannot communicate with the leaseholder, and the leaseholder cannot confirm that two replicas have been written when a write request is received.

The reads on n1 are not stale, since no writes are able to complete on either side of the partition.

When the lease expires, either n2 or n3 will conduct an election and become the new leaseholder. At that point, all reads and writes conducted on the majority side will succeed, and since no writes were accepted by n1, there should be no inconsistency when n1 rejoins the cluster.

Hope that helps clarify, let me know if you have any questions.

-Tim


(Devangana Tarafdar) #3

Hi Tim,

Thanks you for the patient reply. Its clearer now. I had a couple of more questions on this .

  1. How long does it usually take for the lease time to expire
  2. I had some confusion regarding one the config params. Is the parameter time until store dead involved in this process at all , it seems to me that that parameter is only looked at for rebalancing replicas for a dead node so just wanted to get confirmation
  3. When a node dies, then is the same process followed by the remaining nodes ? i.e for a brief time they don 't take reads /writes and then resume after the lease expires for the dead leaseholder

Thanks,
Devangana


(Tim O'Brien) #4

Hi @dtaraf,

Leases are 9 seconds long, so that’s the maximum time that a range would be unavailable in case of a partition.

time_until_store_dead is the time that a cluster will wait until it considers a node down rather than temporarily unavailable - it’s not the same as the duration of the lease. The lease governs which node can actually complete a read or a write for a given range, while time_until_store_dead is the length of time we’ll wait before trying to upreplicate ranges from an unavailable node to the rest of the cluster.

Yup, same process. If a down node was the leaseholder for a range, reads and writes to that range would fail until the remaining nodes elected a new leaseholder - 9s max. After that, we’d wait until time_until_store_dead has passed, and then attempt to move the dead nodes’ replicas to a live node, if one was available.

Hope that helps!

-Tim


(Devangana Tarafdar) #5

Thanks ! That cleared up a lot of things for me.


(Tim O'Brien) #6

My pleasure, happy to help!