Benifits of leaseholders

I was reading docs and papers on CockroachDB, and would like to clarify the benifits of introducing the leaseholder in CRDB.

It seems to me that the leaseholder is the same as the raft leader.

By introducing this extra abstraction, CRDB needs to worry about the co-location of leaseholder and raft leader.

So why CRDB introduce this extra abstraction?

Welcome to our forum @ming! This isn’t an area I work on, but I was curious about the answer myself. This is what I found out.

Basically, this is a consequence of the fact that CockroachDB uses Raft, and doesn’t want to deviate too far from its implementation. Leaseholders are a CockroachDB-specific concept, and have more properties than what a normal generic Raft leader in any distributed system would have.

One specific difference is that Raft leadership relies on the wallclock time, whereas our leases mostly use MVCC timestamps. The two types of timestamps are pretty different.

I’m not sure we’ve ever really written down the reasons why we have leases, so let me try a “five whys” style answer:

  1. Why do we have both leases and leaders?

As Rafi said, leases and leaders deal in completely different kinds of time. Raft is a time-agnostic protocol, dealing primarily in a sort of logical clock defined by term numbers and log positions, with a failure detector that operates in stopwatch time (not wallclock time as in Rafi’s message). Leases, on the other hand, deal in the MVCC timestamps used by our transaction protocol, which are essentially wall-clock timestamps (We’re actually using a hybrid logical clock here, but that’s not an interesting part of the design and I’m going to ignore it for now)

  1. Why does consensus use logical time?

Could a non-raft consensus algorithm unify these two notions of time? That is, could we implement consensus while tying any notion of leadership to ranges of MVCC time? Possibly, but doing so would make it much harder to implement things like epoch-based leases or quorum leases. It’s also harder to guarantee liveness when the consensus protocol depends on wall-clock time.

  1. Why does MVCC use real (wall-clock) time?

Our use of wall-clock time has many consequences, such as the requirement that clocks be reasonably well-synchronized. Was there any way we could have stayed in the abstract space of vector clocks and avoid all the complexity of real time?

The main drawback of vector clocks is that there is no clear definition or easily-accessible value of the “current” time. If a read query arrives, what timestamp should it use? We must either communicate with other nodes or risk using a timestamp that is stale (in some dimensions). The use of a wall-clock timestamp (plus other protocols for handling the uncertainty in clock synchronization) gives us a usable answer to the question of “what time is it?”

  1. Why do we need to know the current time?

This topic is discussed in a slightly different scenario in our blog post and RFC for bounded staleness reads. Any transaction that reads from multiple keys must use the same timestamp for all of them in order to see a consistent view of the data. Unless we have a reliable local indicator of the timestamp to use, a timestamp-negotiation phase may be required before the query can be executed.

  1. Why are multi-key reads so important?

The bounded staleness RFC discusses a fast path for “simple” queries that can avoid a timestamp negotiation. This approach would also work for vector clocks. Could we do something like that for “simple” queries and fall back to a negotiation phase for more complex ones? We don’t think so, because multi-key operations are extremely common in SQL because of indexes (This is the ultimate answer you’ll find behind a lot of CockroachDB’s design decisions - even if you don’t think your application requires transactions or a high level of transaction isolation, unless you have a very limited schema CockroachDB requires these things to maintain indexes and constraints on your behalf)

1 Like