If the node fails to write to the raft log, how does the follow-up work?

(yeriahn) #1

I found this senario.
https://www.cockroachlabs.com/docs/stable/architecture/reads-and-writes-overview.html#write-scenario

If the node fails to write to the raft log, how does the follow-up work?

Does the replicated node continuously try to replicate the leader node?
or

Does the leader node send another request to receive an ack?

0 Likes

(Ron Arévalo) #2

Hi @yeri,

If one node fails to write to its raft log, that’s fine so long as this node isn’t the leader. If a majority fail, then things will hang and raft will keep retrying internally. If the raft leader fails to write to it’s raft log, then it will crash.

Specifically if the one node that fails to write to it’s raft group isn’t the leader, otherwise if the leader fails to write to the raft group, then you would see the node crash.

Also, I’d like to get some more clarification on what exactly you’re asking about. I know you mentioned the scenario above, but are you asking more about transactions that do not commit (which could happen for any number of reasons) or are you specifically asking about failures to write to disk?

Thanks,

Ron

1 Like

(yeriahn) #3

If majority of node write the raft log, the commit is successful.
If there are three node, the majority is 2.
So one node don’t write the raft log not yet.

If that situation, one node fail to write the raft log, how can that node try write the raft log?
Is the leader node check to write the raft log success with heartbeat?

0 Likes

(Ron Arévalo) #4

Hey @yeri,

If 2 out of 3 nodes write to the raft log, the commit is successful and CRDB will continue to try to up-replicate to the third node, if it can’t, we mark the range and under-replicated.

If the leader A receives a write, it then sends those writes to B and C, it waits to hear back from either one before it is considered COMMITTED. Once it is committed, the third node will report back as soon as it can depending on available CPU, network latency, and disk IO.

It may helpful to read through some of our trainings on replication which can be found here:

Data Replication

Fault Tolerance & Recovery

Let me know if you have any other questions.

Thanks,

Ron

0 Likes