Questions about EXPERIMENTAL CHANGEFEED FOR and cluster_logical_timestamp()

Questions about EXPERIMENTAL CHANGEFEED FOR and cluster_logical_timestamp():

  1. How frequently can I expected timestamps to be emitted relative to resolved setting? I’ve noticed on idle tables it can be many seconds even if I set resolved much lower (but I understand that’s a “no more frequent than” setting). How frequent is likely on a busy table/db? Not looking for exact numbers or anything, just trying to understand how emitted timestamp frequency works to see if I can leverage them in my application (for keep alive needs and in general keeping timestamps for changefeed resumption).

  2. For a “updated” or “resolved” timestamp, which is unix nanos + “.” + logical counter of course, I have a use case where I need to collect all updates for the same “unix nanos” (ignoring logical counter) before atomically doing something. Obviously separate events can come across consecutively with the same unix nanos, right? How long of a delay might I expect between two updates with the same nanos? Is it based on synchronization latency or any other sync setting? Not looking for guarantees here, just trying to understand. Another way to phrase this is “is there any way with the changefeed I can atomically operate on all updates of a transaction and know the end boundary”?

  3. Right now, for “key_only”, I can’t tell if it’s a delete like I can by knowing I have an empty “after” on full envelopes. Any suggestions or GH issues to watch on using “key_only” but still knowing it was a delete?

  4. For manually-resolved-set ticks I see { "__crdb__": { "resolved": "someval" }} and for others I see { "resolved": "someval" } (19.1 is my regular db, though I’ve seen it). Just a note, I have prepared for both cases.

  5. In earlier versions of 19.1.x I could check the error message for “replica GC threshold” to see if the initially provided “cursor” was too old, but in later 19.1.x versions and 19.2+ this just returns as “table not exists”. Any suggestions on determining that a cursor is too old to be manually provided? Should I just check for the table existing first, and then if I get a table-doesn’t-exist assume the cursor is too old?

  6. I have read many forum posts about not being able to trust cluster_logical_timestamp() is uniquely monotonic for each invocation in transactions operating on unrelated rows (must less cluster_logical_timestamp()::INT how I might use it). If I want a monotonically increasing timestamp per insert of a table (meaning unique_rowid() wouldn’t work), can I use cluster_logical_timestamp() to populate the primary key of a common table? Maybe detecting unique violation and trying again and again, can I still guarantee it’s increasing at least on guaranteed non-concurrent inserts (i.e. first completes completely before second started)? Also, is cluster_logical_timestamp() related to the updated timestamp that comes over in the changefeed (i.e. if I set that value on the column, what is the likelihood it’s the same as the updated)? I understand my use case is limiting and I will likely have a more consistent opt-in option in my uses for using a sequence.

  7. (EDIT: just added, sorry)…is there any relation to the changefeed and the order of inserts? If I have a table with a column whose default value is a sequence’s nextval, can I guarantee that every changefeed record of the insert will have an increasing sequence? Or can I get changefeed out of order from the sequence?

Thanks in advance!

Thanks for your interest in changefeeds. I will give a shot at answering some of your questions, for the others you’ll have to wait for a more experienced expert to chime in too.

Today the changefeed mechanism tracks changes using a polling mechanism.
The polling interval is controlled via the cluster setting changefeed.experimental_poll_interval (which we do not advise to change!). The polling can also get slightly slower (with additional jitter) if the nodes experience high CPU usage.

Yes I think you can play with the “closed timestamp” (resolved?) boundary that’s available in the timestamp. If you see a closed timestamp past your timestamp of interest, you’re guaranteed that there won’t be any other txn committed with that timestamp any more.

(I still would like an expert to confirm this explanation)

(I don’t know how to answer this)

(I don’t know how to answer this)

I did not know that you were relying on that particular error text. I think your use case is legitimate though. Please file an issue for this, we really need to introduce a reliable error code and/or way for you to recognize this condition. Please explain in the issue what you do/need, how you did it before and why it does not work any more.

The forum posts you refer to are about transation_timestamp() (a.k.a. now()) and the other pg timestamp variants. These are definitiely not guaranteed to be related to the txn ordering in any simple way.

In contrast, cluster_logical_timestamp() is guaranteed to evolve monotonically with the actual txn commit order.

There is a pitfall you should be aware though: merely using that function in a txn disables a couple of important optimizations inside CockroachDB and will make your workload more sensitive to contention, and will increasing the likelihood of retry errors. This is because once your client has observed that value CockroachDB is also making a promise it won’t change before the txn commits; if you did not observe it, crdb would be free to change the txn timestamp “at the end” to re-order it after another conflicting txn and avoid the conflict.

That said, I want to be honest with you and I think that you “want[ing] a monotonically increasing timestamp per insert of a table” is not going to work well with CockroachDB right now. There are two points that I can share with you:

  • in our experience, when other users have expressed this requirement before, it turned out they really wanted something else. It would help if you could step out and give us the big picture, perhaps there’s something else that would help.

  • eventually, CockroachdB should be able to report to you the MVCC timestamp of a committed value after the txn commits. So you know, at a low level the commit timestamp of every value is stored alongside the value. It’s a current limitation that your app cannot see this, but really we know of tons of use cases that would benefit from exposing that timestamp to apps. I encourage you to push for this feature (either comment on the issue if one exists already, or file a new one).

No transaction commit order and sequence order are unrelated.

1 Like

Here’s some commentary on 1. and 2.:

The changefeed does not provide ordering guarantees for updates from different rows. If you want to be sure that you’ve received all of the messages for some timestamp then you need to wait for a resolved timestamp. The resolved timestamps are driven underneath by the closed timestamp interval. By default that interval is quite large, ~30s and updated every 6 seconds. Starting in 19.2 it’s safer to move this rate down which should increase your rate of resolved timestamps. The relevant settings are:

kv.closed_timestamp.close_fraction and kv.closed_timestamp.target_duration. In general you will not resolved timestamps more up-to-date than the target_duration and you will not receive them more frequently than close_fraction*target_duration. Taking these values too low will have serious overhead. A 5 second target_duration on 19.2 seems fine.

On 19.1 having a low target_duration can put schema changes at risk of never completing.

3., 4., and 5. are all good issues and should get filed.

As for 5. my sense is that it’s not directly related to the 19.1 version you were using but rather is related to how old the cursor is and how quickly GC is happening on your cluster.

Raphael’s answers to 6. and 7. are spot on.

1 Like

knz and ajwerner have gotten most of these. Let me see if I can answer the remaining ones.

  1. I have a hard time imagining what a deletion would look like in key_only. If you do file an issue, please include an example of how you’d like it to look.

  2. In 19.1, we changed the default format for newly created changefeeds from format=row to format=wrapped. One of these feeds (the one without crdb) was created with a pre 19.1 cluster and still has format=row. If you want them to be the same, you can stop one of them and replace it with a new changefeed that explicitly specifies the other format.

(edit: markdown is being too smart here. the below is supposed to be 7 not 5. okay adding this comment fixed it)

  1. As knz and ajwerner have pointed out, we don’t guarantee anything more than a single row’s mutations will be in order. But if you actually look at the output, it likely will look mostly sorted, which makes sense given the current implementation. I could see doing something like a special case for seeing them in sequence order as a performance optimization, but you’d still absolutely need to handle the general case of getting them out of sequence order.
1 Like

Thank all of y’all for the responses! If y’all are curious, my use case is that I’m backing Etcd w/ CockroachDB. Some responses below…

Eek that’s a bit slow at even a 5 second recommendation. Another question then, can I assume that at least the changefeed events are given to me in order by updated timestamp? Or is that what you’re saying here, is that I can’t even guarantee the changefeed is in order? So I could get an updated timestamp for one record, get a greater updated timestamp for another record after that, and then get a lesser updated timestamp matching that first one after that (on unrelated rows)?

Also, what makes a row related? (I know this is a bit of a loaded question, but I’d even accept something simple like “sharing a unique constraint or FK check on a value that was updated”)

For #3, I filed #42602. I think a deleted boolean in the JSON would be fine.

For #4, I filed #42603.

For #5, I filed #42604. I mentioned in there, I’d also like to know in the error message or whatever what the earliest accepted cursor is. The reason is in Etcd this is the equivalent to the “revision of last compaction” which is provided back to the user.

These posts were what I was referencing: Use cockroachdb cluster timestamp in time-travel queries, Using transaction timestamp to get the rows that were updated since the last query, Unique_rowid() monotonicity question, etc. On that last one you said:

So to confirm, while the transactions may be unrelated, that value will always increase monotonically? Is it unique for every invocation (if I include logical counter, obviously not if I ::INT it for just its nanos)?

Yeah, I just want to get as close as I can. I have a created my “watch” implementation as buffered to properly order the updates from changefeed and these updated timestamps to make sure it appears monotonically increasing from the caller’s perspective.

The big picture is I want to leverage Cockroach timestamps instead of a sequence, so I can use it with changefeeds, to solve as best as I can for etcd guarantees. The KV guarantees I am basically there, but for watches which are basically a stream of updates, here are the guarantees:

And if curious, here is my table:

CREATE TABLE key_value (
  version INT NOT NULL,
  create_revision INT NOT NULL,
  mod_revision INT NOT NULL,
  UNIQUE INDEX (create_revision, key),
  UNIQUE INDEX (mod_revision, key)

Those revisions I am populating with cluster_logical_timestamp() on insert/update (I would love to keep the logical counter, but I am limited to a 64 int, though I have considered finding bits to reuse to combine them). I want to use a changefeed to stream updates to this table to satisfy those etcd constraints. I now know I have to wait for a “resolved” event to be sure about “Ordered” or “Atomic”. I am open to any alternative implementation approaches. I feel a Cockroach-backed-Etcd has a lot of benefits over their existing implementation. However, I am trying to be smarter than using sequences for revisions and polling for the watch impl, even if I have to accept some other tradeoffs. If it’s worth navigating the repo, you can see someone else backing Etcd with Postgres here.

Let’s be careful to define “monotonically”. They will be monotonic in txn commit order. Not txn start time or other things. If you use a timestamp() function or a sequence inside the txn, these functions can generate values in a way that is unrelated to the commit ordering.

The only thing that commit order gives you is a before-after relationship between transactions: if A is known to commit before B, then all writes from A are visible from B.

The only way to guarantee that a SQL value in a table will evolve co-monotonically with cluster_logical_timestamp() is to implement your own transactional counter, via e.g. “insert into c(x) select max(x)+1 from c; select x from c order by x desc`” inside each txn. Naturally, the contention price would be quite high.

Regarding uniquness: no you don’t have a guarantee of uniqueness. It’s possible for two nodes A and B to generate the same value via cluster_logical_timestamp() if their transactions operate on unrelated rows.

1 Like

Thanks! So I think I can boil this part down to:

If I populate a column with cluster_logical_timestamp() (full value with counter, accepting performance tradeoffs), it will increase per non-concurrent transaction? I.e. transaction A that runs that and completes will always have a lower value than transaction B that runs after A has successfully completed (retries and all)?

Similarly, if I populate a column with cluster_logical_timestamp() and I receive the events of the updates via a changefeed, between the last resolved timestamp and the next, the things that happened in that transaction will all come together before the next resolved?

And, can I guarantee that the column value populated with cluster_logical_timestamp(), when it comes across in the changefeed, will always be less than the next resolved timestamp? (or is cluster_logical_timestamp() at transaction time unrelated to the resolved timestamp I may get in changefeed)

Forgive all my questions and thanks for y’all’s patience.

yes (but bewawre that the value of cluster_logical_timestamp() may change across retries.

I don’t know. I hope Dan or Andrew can answer.

I think so but again I am not sure.

1 Like

I’m super interested in this use case. I’ve been thinking about it myself. What do you have in mind? Is your plan to implement the server protocol?

If you’re just using this database for this use case you could absolutely experiment with setting the durations lower. That will have overhead. We are actively looking at ways to more cheaply maintain closed timestamps on the order of hundreds of milliseconds, maybe even for this release (no promises!). The current design is not optimized for low latency resolved timestamps. Another thing I want to caution you about is that the closed timestamps have very poor interactions with cluster_logical_timestamp(). To re-iterate, as soon as a transaction observes its commit timestamp (calls cluster_logical_timestamp()) it cannot be pushed. In order to close out a timestamp we prevent transactions from committing below that timestamp. In general that means that we force transactions to be pushed. I suspect any implementation on a system which relies both on closed timestamps and on reading their commit timestamp is generally not going to work at all. Transactions will regularly abort leading to high client-observed latency.

My general advice would be to transactionally use a counter if you need ordering rather than trying to muck with timestamps. If you know that all writes increment some counter then you also have a way to know whether your feed is caught up before you get a resolved timestamp. All that being said, that single key will have high contention and is likely to experience many restart. This would be a good use case SELECT FOR UPDATE if we implemented that properly (also something being worked on). Even that can be worked around by using a different row which all writer UPSERT over before attempting to increment the counter.

On a per-key basis they will be in order. Between keys there is no ordering guarantee.

The following order is permitted:


Where a, b, and c are keys and the @ refers to a timestamp.

I believe that it is totally possible for the events to straddle a resolved timestamp.

For example, I’d expect the following sequence to be valid even if a, b, and c were all written in the same transaction.


I do not think this is the case. You may receive a resolved timestamp that corresponds to an earlier time after receiving data with an updated timestamp that had a later value, see my previous example. The resolved timestamp tells you that you will never see entries with an updated timestamp earlier than the resolved timestamp that you have not already seen.

1 Like

Yup, straight up implementing the gRPC server. Really since I only need a limited subset, I am leaving some features out (e.g. leases though they could be emulated). The work could be open sourced one day, right now I’m in research/PoC mode.

What exactly is meant by “pushed” here? Like deferred to the client to retry or implicitly retried? So if I’m understanding right, you’re saying using cluster_logical_timestamp() makes it not implicitly retryable, but closing a timestamp (i.e. “resolved”) asks not-yet-committed transactions to retry? Assuming you are not talking about the auto-retry that the Go client does, I assume I could detect such an abortion and retry? Granted we’re getting a bit brittle there.

Yeah, I’m going to make that as another “revisioner” impl (I’ve abstracted it). I couldn’t decide if I wanted a sequence or a high-contention single-row table that I increment. I think the sequence is a bit better because I can tolerate gaps so long as it increases. The reason I want to muck around with timestamps was to use AOST queries for revisions in the past.

With the rest of your post, even using a counter table or sequence, I still suffer problems with the changefeed across keys. I have to have revisions in order even on separate PKs.

The more I think about it, the more I might just have to approach this in a traditional RDBMS way and just make a table for the keys and values, with the key and the revision in the PK, and poll. Without any globally increasing unique values that I can leverage with ordering in AOST and changefeeds, there seems to be limited benefit using those features.

It’s a bit complicated. If it’s a single-statement transaction then our SQL layer will try to transparently retry the statement. Unfortunately this may get stuck in an infinite loop if the time it takes to run the statement ends up being longer than the closed timestamp interval. Schema changes in geo-distributed clusters prior to 19.2 would hit this sometimes.

If by the go client you mean then yes, it will auto-retry these failures. I am just saying it will be brittle.

That makes sense. Unfortunately, as discussed above, serializing a write timestamp into a record comes with severe limitations and almost certainly should be avoided. The changefeed “updated” timestamps will correspond with AOST. Generally speaking though I don’t think you’re going to be able to make the etcd revisions API work on top of our MVCC sadly.

There have been independent conversations as recently as last week about exposing the MVCC timestamps and temporality in general to SQL but I’d say at best that’s a faraway dream right now.

Using separate rows for each revision is sane. That’s how I’ve assumed I’d abstract it. I assume you’re looking at changefeeds for Watch. The lacking feature on changefeeds as far as I’m concerned is the ability to scope the rows being watched. I’m looking at implementing that functionality soon.

1 Like

Thanks, that all makes sense. I will back away from timestamps.

It’s not just scoping the rows that’s the problem (I’ve been watching the issue, and I use the CTE approach and crazy stuff like WHERE convert_from(key, 'UTF8')::JSONB->>0 LIKE 'prefix%' in the meantime). I need some kind of ordering, after all I don’t want to see two updates out of order, but since I am now gonna have separate rows for each revision (to support etcd-style time-travel queries), I can’t guarantee that that won’t be the case. The changefeed has to come in order of the revisions or I have to reconcile it. Also, since even the same transaction can straddle “resolved” checkpoints, the Txn feature of Etcd won’t be respected (you’re promised to get all updates of a txn at once, in order).

I’m afraid in its current incarnation I can’t use changefeeds for watches without being able to use timestamps elsewhere and make my PK as just the key.

Ah, forgot about those transactions. One thought is that you could have a separate txns table which stores an entry with all of the keys which were written in a transaction in order. You could then just watch the transaction table and then issue the reads with AOST.

That still doesn’t really help with the time travel queries as you won’t know the time at which to read the other rows. If we had scoped changefeeds you could potentially hack that in by doing a watch without a cursor but with a scope to the transaction row in question (or the revision in question). That, however, would just be a hack to gain access to the MVCC timestamp of a row.

1 Like

Thanks for that…while I have y’all here, any suggestion on what to use for the cheapest monotonically increasing counter per “completed” transaction. I don’t mind if two concurrent transactions aren’t exactly in order, but I want a number in txn A that is guaranteed less than the one in txn B if A completes before B even starts/retries. The reason I am not worried about concurrent transactions is I believe that my transactions on the same key will conflict and one would have to retry anyways, and I don’t care about unrelated keys.

Think I can use unique_rowid() as revision to an insert-only table like so?

CREATE TABLE key_value (
  revision INT NOT NULL
  create_revision INT NOT NULL,
  previous_revision INT,
  version INT NOT NULL,
  deletion BOOL NOT NULL,
  PRIMARY KEY (key, revision DESC)

Note, the txn with the insert will also be selecting from this table for the key which is why I believe it’s “locked” to the key. I assume unique_rowid re-evaluated on txn restart.

(If this answer would be too long winded, I can open another post…I just didn’t think Unique_rowid() monotonicity question answered this specific question)

I don’t think unique_rowid() is going to do what you want. I think you want to use a sequence The unique_rowid() is only roughly ordered. It doesn’t have enough timestamp bits to safely provide a total ordering. The sequence will ensure that two transactions which do not overlap in time have a totally ordered index. It will incur more overhead but my sense is that you’re not overly performance concerned. Please keep us posted on your progress!

IIUC this conversation correctly, if we offered the ability to read MVCC timestamps with rows and we offered scoped changefeeds and changefeed latency to resolved timestamps was much lower (10s to 100s of milliseconds) then this project would be much easier.

1 Like