Using transaction timestamp to get the rows that were updated since the last query

I am trying to figure out how to get all of the rows that have changed since the last query issued. I am using this to sync only the rows that have changed since the last sync to my app. I want to make sure that the sync logic isn’t missing any rows due to concurrent write transactions during the sync query.

Right now I have a setup that roughly looks as follows:

  value INT,
  updated_at TIMESTAMPTZ

-- Used to insert data into the table.
INSERT INTO sample (value, updated_at) VALUES ($1, transaction_timestamp());

-- Used to initially sync all data.
SELECT transaction_timestamp() as current_time;
SELECT * FROM sample
    WHERE updated_at < transaction_timestamp();

-- Used to sync data that has changed since the last sync. $1 is the current_time returned from the last sync query.
SELECT transaction_timestamp() as current_time;
SELECT * FROM sample
    WHERE updated_at >= $1
    AND updated_at < transaction_timestamp();

Using this setup, is it guaranteed that after a sync query from above returns, that no rows will be inserted into the table that have updated_at less than the current_time returned from the sync query? (ie. the syncing logic would not missing any rows due to a row being inserted with updated_at < current_time returned from the last sync query).

If not, what is the best approach to implement this type of syncing?

The logic you propose is not sufficient, because the SERIALIZABLE isolation level does not guarantee that the transactions will be serialized in the order of their transaction_timestamp(). The property that gives you that is linearizability.

There are possibly two solutions:

  • use a single-row table that maintains the last current_time value you have observed. Make every transaction update that single row. Making all txns update that single row will force linearizability (at the expense of contention if you’re performing that sync from multiple clients concurrently)
  • use AS OF SYSTEM TIME. This gives you a global order, but may be slightly cumbersome to use.
1 Like

Thanks for the reply.

I’m trying to figure out why these two transactions that use the same rows would not be serialized in the order of their transaction timestamp. I’m using for my understanding of how the transaction system works.

From my understanding of that blog post, each transaction is assigned a timestamp (I assume this is the timestamp that transaction_timestamp() returns) of which is used as the timestamp for each operation in the transaction. When reading a row range, the transaction’s timestamp of the read is stored in the node local timestamp cache for the row range. When writing a row, the write only succeeds if the transaction’s timestamp of the write is greater than or equal to the timestamp for that row in the node locale timestamp cache:

Read-Write Conflicts – Read Timestamp Cache

On any read operation, the timestamp of that read operation is recorded in a node-local timestamp cache. This cache will return the most recent timestamp at which the key was read.

All write operations consult the timestamp cache for the key they are writing; if the returned timestamp is greater than the operation timestamp, this indicates a RW conflict with a later timestamp. To disallow this, the operation (and its transaction) must be aborted and restarted with a later timestamp.

Doesn’t this mean, that as long the row range that the “sync query” (from above) includes the row that the “insert query” (from above) writes, then the two transactions will be ordered by timestamp?

For example, using the logic from the blog post, if the “sync query” has a timestamp > the “insert query”, then

  • If the “insert query” commits first, then the “sync query” will always read the committed data from the “insert query” since the “sync query” has a timestamp > the “insert query”
  • If the “sync query” commits first, then the “insert query” will be restarted with a higher timestamp since the transaction will consult the node local timestamp cache and notice that the transaction timestamp is less than the timestamp for the row in the node local timestamp cache

Hello Trevor,

Your intuition is in the right direction, and the draft I had written to your original post said it would work based on the same assumptions. But unfortunately there are a few reasons why the assumptions break down here. One critical problem is that the built-in transaction_timestamp() is not the same as CockroachDB’s internal transaction timestamp. It is meant to mirror PostgreSQL’s built-in, and so it returns the same value as now(), which is based on the wall clock time, and doesn’t change if the transaction gets pushed.

As it turns out time in a concurrent system is really hard – it took five engineers here discussing it for an hour to work through the subtleties of your request. In the end, our recommendation is that, for safety, you explicitly force linearizability by referencing a single row, as suggested by @knz above. That has the best chance of having the semantics you’re expecting: never missing a row in the sync.

Let us know if you’ve got any questions. Thanks,

Thanks Andrew for the investigation.

I understand where the issue lies. The problem is that the documentation for transaction_timestamp() states “Returns the current transaction’s timestamp” which actually is different from all of the other documentation that discusses the transaction’s timestamp (see examples right below here)

In terms of transactions, the gateway node picks a timestamp for the transaction using HLC time. Whenever a transaction’s timestamp is mentioned, it’s an HLC value. This timestamp is used to both track versions of values (through multiversion concurrency control, as well as provide our transactional isolation guarantees.

Retried transactions are also issued at a later timestamp, so the transaction now operates on a later snapshot of the database, so the reads might return updated data.

It’s also important to note that retried transactions are restarted at a later timestamp. This means that the transaction operates on a later snapshot of the database and related reads might retrieve updated data.

It might make sense to note in the documentation for transaction_timestamp() that it’s actually the timestamp at the start of the transaction and that it is not the same as the HLC transaction timestamp that is discussed elsewhere.

I do believe that there is value in exposing the HLC transaction timestamp in addition to the current timestamps that are exposed. For example this new function could be called commit_timestamp() which returns the HLC transaction timestamp. Another option would be to change the functionality of the current transaction_timestamp to return the HLC transaction timestamp, however doing this may be confusing for users migrating from PostgreSQL.

Doing this would allow the use case that I mentioned above to but also other use cases. For example, if you have a schema with an foo column (or any timestamp column), assigning commit_timestamp() to foo will always insure that the values that are stored in the foo column (using commit_timestamp()) will always be monotonically increasing for a given row.

I also do not believe that the alternative provided above (ie. use a single row that maintains the last current timestamp observed) is reasonable since it essentially limits the concurrency of the system.

I would definitely be happy and willing to contribute code, documentation updates, etc. to introduce such functionality.

Hi Trevor,

You raise a very good point, the documentation is misleading on this issue. I also thought those two uses of the phrase “transaction timestamp” referred to the same thing when I first read through these docs. I’ve gone ahead and opened an issue to fix the documentation for that built-in:

Regarding your second point, there is a (currently undocumented) built-in called cluster_logical_timestamp(), which is precisely your suggested commit_timestamp() function. Using that correctly is subtle and requires significant care to be taken around the internals of CockroachDB, specifically the timestamp cache. One major concern is exactly the (reasonable, yet incorrect) assumption you made: that the values “will always be monotonically increasing”. The cluster_logical_timestamp() does not guarantee that in the general case.

This is what we spent all that time yesterday debating – there may be a way to meet your requirement, but it’s not necessarily as simple as replacing use of transaction_timestamp with cluster_logical_timestamp. For one thing, there is disagreement as to how an index on your updated_at column would interact with the updates to the timestamp cache based on the sync queries (it might introduce correctness issues), and there’s a question about how (or if) we could make this strategy face less contention than the one recommended above. In the end we couldn’t come up with a solution that I was 100% confident reporting to you as safe to use.

That being said, I’ve opened an issue to discuss documenting the cluster_logical_timestamp() built-in: We may not end up doing it, but at least that will give us an opportunity to have the discussion publicly about why or why not we make that decision.

Thanks for your detailed and thoughtful conversation here. Please do feel welcome to contribute to the discussion on the two issues I linked, and if you’d like to contribute to the documentation I’d be happy to point you in the right direction.