Will CockroachDB block in case of inavailability of a range with PENDING transaction record?


I am trying to understand if there are any situations when CockroachDB may block in case of unfortunate node failure during commit of multi-range transaction. I would appreciate if you help me understand CockroachDB behavior in the following scenario:

  1. There are two ranges in the system with replication factor 3: 1 and 2
  2. There are 7 nodes, 3 for each range plus transaction gateway node: N1_1, N1_2, N1_3, N2_1, N2_2, N2_3, N_TX
  3. Transaction is started on a gateway node
  4. “PENDING” tx record is written to range 1 nodes as first updated key belongs to this range
  5. Some writes are performed to both 1 and 2 ranges
  6. Finally, we loose both gateway node as well as majority of range1 nodes. E.g. N1_1, N1_2 and N_TX are out of topology.

At this point range 1 is unavailable, so we have no clue whether tx was committed or not, range 2 has some dangling write intents. Could you please explain whether there locks will be released or not? If they will be released at some point, then how it is guaranteed that transaction is not in COMMITTED state on unavailable range 1, and that nobody had been able to read updated values from range 1 before it went unavailable?

Thank you.

Hey @devozerov ,

That’s a great question, and I hope I can shed some light on this scenario for you. In order to really understand this question, we need to review how CockroachDB transactions are all atomic, meaning it does all necessary reads/writes/etc, or nothing at all. In this case, if the leaseholder had already comitted the data in the range in question, it must have gotten confirmation from a majority of nodes that the write was performed before the data is committed. If not, then the transaction is aborted and any pending writes are cleaned up.

The specifics of the transactions record and how its processed and handled are described in the documentation and blog post linked.

Let me know if you have any other questions.


Hi Ricardo,

Thank you for your answer. However, I am still confused a bit. I learned the following from CockroachDB documentation and blog posts:

  1. When transaction started, it’s marker record with “PENDING” state is written on the first accessed range.
  2. When a write is performed while transaction is active, it’s stored as a write intent
  3. When transaction is committed, marker record on the first range is updated to “COMMITTED”
  4. Then write intents are propagated to committed writes asynchronously , or by explicitly querying the range containing transaction marker.
    Is this description accurate?

In the previous post I referred to the following situation:

  1. “PENDING” marker is written to range 1
  2. Writes are applied to range 1 and 2, and respective write intents are created
  3. Range 1 goes down (either before commit, or after commit, but before write intent propagation request is sent to replica 2)
  4. Transaction initiator node also goes down.

Now we have a problem: range 2 contains uncommitted write intents and has no way to understand whether respective transaction was committed or not. We can neither commit, nor rollback these intents, hence the system should block writes on affected keys. Am I correct?

Hey @devozerov,

So your description is pretty spot on, however, I do want to emphasize one specific point. I just want to expand a bit further on your description of the asynchronous writes. This asynchronous write only occurs after the majority of nodes have acknowledged the write intent, and have committed the record. This is to ensure that the data has been written atomically and to ensure the durability of the data should a node be lost.

In your scenario, the transaction record PENDING is appended to the first range, range 1. All write intents are then performed along with the first two ranges, as you describe. So let’s break down each scenario when a range goes down (or a majority of the nodes are unavailable, just to be clear).

Before COMMIT:
    If the range goes down before the commit, that would indicate that the majority of nodes is unavailable,
    so the transaction would be cancelled.

    If the data has already been committed, then each node already has that transaction record updated     after having achieved the majority needed, and so the node that is still available (if any) would simply     perform the asynchronous write of those ranges the next time that DistSender goes through the     ranges.

In order to be clear, the write intent is sent to the other nodes (or as you specified replica 2) when the write intent is being committed. If this node, or replica, is needed to reach a majority of nodes, then the transaction cannot be committed and will be aborted or sent to retry, depending on outage reasons. This is absolutely necessary for the distributed data to be atomically written. So the only way that the data on node1 is committed, in a 3 node cluster, as if at least 1 other node also committed the data, and the commit was done by the transaction gateway.

If there is a transaction record on a specific node that does not comply with the majority of the cluster, then the transaction record will simply be aborted and cleaned up. This is how we ensure atomic transactions so that it is all or nothing.


Hi Ricardo,

Do I get it right, that transaction record is written only to the first accessed range? Basically, the source of my confusion is this document [1]. It states that a transaction record is appended only on the first accessed range, and that write intents are resolved asynchronously. Now I imagine the following situation:

  1. Range 1 has a PENDING transaction record and a write intent for key A
  2. Range 2 has a write intent for key B

Then commit and immediate shut down of a range 1 occurs (or a split brain between ranges). And we have the following situation:

  1. Range 1 - has COMMITTED transaction record and a resolved write intent, but is inaccessible from range 2
  2. Range 2 - has unresolved write intent for key B, and cannot resolve it since range 1 is inaccessible

With my current understanding, this is a blocking situation for the key B. We cannot return value from current write intent, because we do not know whether the transaction is committed or aborted. At the same time, we cannot return the previous value either, because there is a chance that the range 1 committed the transaction and someone read the updated value of A before range 1 becomes inaccessible.

I would appreciate if you help me either find a mistake in my reasoning or understand how it could be resolved without blocking?


[1] https://www.cockroachlabs.com/docs/stable/architecture/transaction-layer.html

Hello @devozerov

So in the situation where a majority of the nodes are unavailable and range 1 is inaccessible, then unless the leaseholder of range 2 is available, this is a blocking situation and the value on range 2 cannot be returned. In this type of scenario though, if the majority of the replicas in a range is unavailable, then the data would need to be restored from a previous backup. If you are concerned about losing multiple nodes on your cluster, then it may be better to increase the replication factor to 5. This ensures that in case multiple nodes are lost, the majority of the replicas of the ranges are still available and thus the data can still be accessible.

Let me know if you have any other questions.