Introducing dynamic time stamping to cockroach DB

I am working on a masters project to reduce the abort rates in Cockroach DB under high contentions. I have briefed the idea of my work below. I would really appreciate any insights and discussions on this.

CockroachDB uses a fixed timestamp allocation scheme, and assigns timestamps at the start of the transaction, which are used as transaction’s commit timestamp. This commit timestamp is used to order the transactions in logical timestamp order, and hence enforce serializability. SSI, unlike SI, detects and handles read-write conflict but the restrictive fixed timestamp ordering leads to higher number of aborts under contention.

To reduce the number of aborts at high contention, we plan to employ dynamic timestamps, in contrast to fixed timestamps, for allocating the commit timestamp of a transaction. In this technique, system tries to dynamically allocate a timestamp range to the transaction, based on the conflicts on elements accessed in the transaction. At commit, it finds a timestamp for the transaction from that range, to fit that transaction on a logical serializable timestamp order. We plan to explore the dynamic timestamp allocation technique we designed in our previous work MaaT (paper attached to mail), which is an abbreviation of “Multiaccess as a Transaction" Adapting the dynamic timestamp allocation technique with the SSI implementation in CockroachDB would eliminate some of the aborts. Our technique targets cases where transactions seem non-serializable under fixed a timestamp ordering, while are serializable when timestamps are dynamically allocated to fit them in a serializable timestamp order.

Consider the following two transactions.
T1 : r1(y) r1(x)
T2 : r2(x) w2(x)
Suppose the execution of the two transactions is as follows. H1 : b2 r2(x) b1 r1(y) r1(x) c1 w2(x) c2

At w2(x), the SSI approach in Cockroach DB sees that x is read by T1, which is a transaction with later timestamp. To ensure timestamp ordering, with the fixed timestamp allocation, transaction T2 is aborted since it causes RW conflict with T1.

In case of using dynamic timestamping technique, suppose lowerbound(T1) and
lowerbound(T2) are the lower bounds and upperbound(T1) and upperbound(T2) are the upper bounds of the transactions T1 and T2 respectively. On detecting the RW conflict at w2, the lowerbound(T2) is made greater than upperbound(T1) so that history will be equivalent to the serialization order T1 −→ T2. Rather than aborting the transaction due to the initially allocated timestamp order, the dynamic timestamp allocation can re-order the logical transaction order, and lead to commitment of both the transactions in this case.

The implementation changes would include
1) introducing write soft locks and read soft locks on data item just to track the conflicting operations on the data item ( the existing write intents could be modified to work as soft write lock, but it should be changed to support multiple intents )
2) Introducing field in transaction record to hold list of transactions that have placed write soft lock when it tried reading and list of transactions placed read soft locks when it tried writing.
2) Modifying read timestamp cache to hold just committed reads.
2) Introduction of validation phase at end of transaction to decide on the commit timestamp based on all read and write locks on the data item.

Insights shared by Andrei Mate @andrei,

  • As you’ve seen, CockroachDB’s model is quite different from MaaT’s. Changing it to fit would be a very big undertaking. Do not underestimate the complexity of our code. Things like introducing a validation phase before commit, at the very least, would be quite involved.
  • supporting multiple “write intents” on a key is something that has been discussed a number of times before and never pursued because of complexity. Most recently, this conversation happened that you might want to skim through: https://github.com/cockroachdb/cockroach/issues/5861.
  • One thing you should be aware of - - transactions that consist of a single batch of operations (so there’s no “client” logic) attempt execution as “1-phase-commit” (an atomic, non-transactional batch of commands). If that’s successful (no conflicts) then there’s no extra “commit” Raft command, no intents, etc. See replica.executeWriteBatch. This is a valuable optimization and it might affect your benchmarks.

Andrei Matei’s answers to my questions.

  1. If you were to adapt MaaT to cockroach DB, what all changes would you have done in Cockroach DB. ( just a briefing on design and code)
    Andrei : That’s a tough question :). I have no idea. CRDB’s transaction model is not my main area of work.
    I’d start by thinking how MaaT needs to change for MVCC. Then, the idea of maintaining a window of time to which a txn can be pushed seems to be fundamental to MaaT, so I’d think about how that can be maintained (one thing to keep in mind is that, since we’re a general SQL database, there’s no delimited read/write/validation phases to our transactions, as there seemed to be in MaaT. However I couldn’t tell if the MaaT actually needs these phases to be distinct).

  2. Which among those changes would you think could be a hick-up or would disturb the working of cockroachDB to much.
    Andrei : Well, one thing that again comes to mind are the complications around allowing multiple write intents on a key (https://github.com/cockroachdb/cockroach/issues/5861). In MaaT these are the write locks.

  3. Overall all how much time would you think it would take like newbies to cockroach DB like me do all those changes.
    Andrei : Let’s put it this way - changing anything around the transaction model is an ambitious enterprise. The codebase is large and this is one of the more complex layers - subtleties everywhere. Changing the transaction model completely is not something that I would personally attempt, even though I work on this product and understand most of the code.
    There might, however, be limited cases that can be improved with ideas from MaaT without replacing everything. Perhaps along the lines you were already thinking about. I would personally encourage anyone to start small when it comes to this particular layer and attempt more targeted code changes before deciding to rewrite too much.

  4. If I have to place the validator ( MaaT’s ) in crdb. Where do you think will be appropriate? According to my understanding, it will be good at dist_sender, because it has access to transaction record and will have knowledge on all keys transaction has touched.
    Andrei : Depending on what this “validator” does exactly, I think maybe you want to put assign this role to the replica that processes the commit request (i.e. the EndTransactionRequest). These requests are routed to the range where the txn record lives (which is the same as the range of the first key written to by the transaction). One of the replicas of that range will process this EndTransactionRequest, and it has (local) access to the txn record (see replica_command.go - evalEndTransaction).

Thanks Andrei for the inputs. It really helps.

I would appreciate if anyone can pitch in to share more opnions and insights on my approach. I will be eager to discuss and understand cockroach DB better.

Thanks,
Ravi Kumar
@semicolon

2 Likes

@andrei , you mentioned “One of the replicas of that range will process this EndTransactionRequest, and it has (local) access to the txn record” . If a transaction spans over multiple ranges, EndTransactionRequest will be processed by all those ranges. So, where is the final decision made ?

I believe that EndTransactionRequest only goes to the range which has the txn record. The txn record is an internal key (TxnMeta.Key) to which we “tie” the final decision of the transaction. We choose it so that it is on the same range with the first key modified by the transaction for performance reasons, but it could be anywhere as far as correctness is concerned.

@radu Thanks for responding. You mean to say even though the transaction spans accross multiple ranges, only one among them have the transaction record ? If so, if other ranges( that dont have the transaction record with them) wants to change the status of transaction to “abort” due to conflict in that range how would they do ?

Yes, a transaction has a single txn record - that is how we achieve atomicity. Aborting or committing a transaction can only happen by doing a request to the range that has this record.

In CockroachDB, write intents act as an exclusive lock so that the node that holds a transaction record can perform the final commit step (“flipping the switch” as described in our blog post) without consulting any other nodes. It might be better to think of this as a capability to commit instead of a lock. This is an optimization for the case when there is no contention, since it reduces the second phase of our two-phase commit to a single RPC instead of a round of RPCs to all participating nodes. During a conflict, breaking the lock (or revoking the capability) of an existing intent requires a write to the transaction record (via the PushTxn RPC in abort mode, which ensures that a subsequent commit of this transaction will fail).

Extending this model to allow multiple intents per key is tricky. Either we continue to guarantee that an intent is a capability to commit, in which case adding a new intent that adjusts the time bounds of any existing intents requires a PushTxn-like operation for each intent’s transactions, or we temporarily allow conflicting intents and require an extra round of conflict resolution before the transaction can commit. The latter sounds closer to MaaT’s “validation phase”, but I think the former would be more feasible to implement in cockroach.

“Dynamic timestamping” sounds similar to how CockroachDB handles transactions in snapshot isolation (as opposed to SSI). The transaction’s write timestamp is increased as it runs and it is allowed to commit at that timestamp without restarting. This is because in the single-intent model, a write intent effectively covers the range of timestamps from its original value to infinity. This reduces the isolation of the transaction from SSI to SI because we don’t have a similar adjustment for reads. One way to introduce dynamic timestamping ideas into CockroachDB would be to allow reads to block off a range of time in the timestamp cache (not to infinity, but some reasonable offset from their starting point). This would potentially allow SSI transactions to commit at a higher timestamp without restarting.