Question about consistency (monotonic timestamps)


(Ivan Dubrov) #1

We have a table for which we need to maintain a log of updates, a history. We are currently using timestamp as a way to order these updates. Pretty much, two tables, the only difference is the primary key (data table only keeps the latest, data_history keeps all of them):

CREATE TABLE data(
  id STRING,
  updated_at TIMESTAMPTZ NOT NULL,
  data STRING,
  PRIMARY KEY(id)
);
CREATE TABLE data_history(
  id STRING,
  updated_at TIMESTAMPTZ NOT NULL,
  data STRING,
  PRIMARY KEY(id, updated_at)
);

The constraint we have is that each next update should have a timestamp that is greater than the previous one, so we need a monotonically increasing timestamp.

Notice that we only need that guarantee per each row, not across all the rows. It’s okay if last update to row with id 2 will have smaller timestamp than last update to row with id 1 even though update to 2 happened later than update to 1.

However, when updating the same row 1, updated_at should only grow.

So, when we have an update we want to:

  1. Upsert into data table.
  2. Insert into data_history table.
  3. Make sure that new updated_at is strictly greater than the previous value.

My understanding was that statement_timestamp() is not guaranteed to be “monotonically increasing”, so I ended up doing the following query to enforce that it is monotonic:

INSERT INTO data(id, updated_at, data)
VALUES ('1', statement_timestamp(), 'hello')
ON CONFLICT (id)
DO UPDATE SET updated_at = GREATEST(excluded.updated_at, data.updated_at + '1 microsecond'::interval), data = excluded.data
RETURNING updated_at;

However, after reading the last blog post, it seems to me that it would not be necessary? If statement_timestamp() will ever happen to be the less than the previous one, that would mean my current transaction timestamp is smaller than the value timestamp and it falls into this “uncertainty” scenario and transaction is restarted.

Am I correct? Can it be replaced with

INSERT INTO data(id, updated_at, data)
VALUES ('1', statement_timestamp(), 'hello')
ON CONFLICT (id)
DO UPDATE SET data = excluded.data
RETURNING updated_at;

?


(Raphael 'kena' Poss) #2

statement_timestamp(), like all other pg-compatible time functions supported by CockroachDB, uses the node local system clock which is subject to backward jumps and drifts.

Therefore, it does not provide a guarantee of monotonicity and therefore the replacement you propose is incorrect.

A couple general recommendations:

  1. it’s not possible in cockroachdb to create a guaranteed-monotonic value that’s 1.a) time-like and 1.b) that’s different per-statement, without forcing serialization through some shared table. This has, naturally, a serious performance implication which you should be mindful of.

  2. if you were to drop 1.a you could perhaps satisfy yourself with SQL sequences, which are much faster to access (and increment) than a record in a shared table. However of course SQL sequences still go to a shared counter in storage and there will be a serialization effect.

(In general, anything you’d want to increase per-statement will need to go and access a shared object somewhere, with a non-negligible impact on maximum throughput)

  1. if you were to drop requirement 1.b, you could perhaps make-do with the internal timestamp generated by CockroachDB for an entire transaction. This is not revealed to SQL clients via a supported function (although we have an experimental function for it called cluster_logical_timestamp()) because letting clients use this timestamp disables several important performance optimizations inside CockroachDB and generally increases the probability of transaction conflicts and manual retries.

Finally, some parting words: our general experience with users and their apps is that the apparent need for a “history table” is usually a symptom of a higher-level business requirement that can be better matched by other approaches.

It would seriously help if you were to share your “big picture” with us so we can see how CockroachDB can better serve your needs.


(Raphael 'kena' Poss) #3

To clarify, the mistake is in this sentence:

Your assumption is that statement_timestamp() “under the hood” is using a source of time that’s monotonic and increasing in relation with real time on the node, so that comparing to values t1 and t2 convey meaningful information about the ordering of the two measurements in time and in relation with other things (like txn commit order). If that was the case, your sentence would be true and the proposed replacement would be correct.

My point is that’s not true. It’s possible for statement_timestamp() at one moment to say “It’s 3 o’clock” then 4 seconds later “it’s 1 o’clock” because some sysadm changed the system time. It’s completely bogus to order things.


(Ivan Dubrov) #4

What you are saying is correct, but I do have a shared table and a shared row. All I need is that when I update row with, say, id of “123”, every update (“upsert”) generates a timestamp that is higher than the previous value. Then, I would return that value using “RETURNING” clause and use that to insert that timestamp in a history row.

Yes, history rows are unrelated to each other and if I was using only a history table, I would have all these issues with monotonicity you are mentioning.

However, I’m not doing that. Forget about the history table, the question is simpler: how to generate a monotonic timestamp-like value when updating a single row. So, I DO have a serialization happening already: I’m updating the same row. Notice that I’ve already provided solution that uses greatest of a new timestamp or the old one plus 1 microsecond. This, I think, will guarantee the monotonicity. Now, I have feeling (which could be wrong!) that it wasn’t really necessary to do that timestamp math because of the way CockroachDB uses timestamps internally.

I don’t think saying that timestamp could be fully arbitrary (or maybe I misunderstand how CockroachDB works internally). Wouldn’t CockroachDB node commit suicide if its clock drifts too far from the other nodes to avoid data inconsistencies? Isn’t it using timestamp for providing its guarantees?

I got my high-level understanding from https://www.cockroachlabs.com/blog/consistency-model/ and https://www.cockroachlabs.com/blog/living-without-atomic-clocks/, which to me tells that maybe I can on timestamp here. Again, I’m not talking about updating arbitrary rows – all I need is that updating a single row produces a timestamp value that is higher than the previous one.

If I’m wrong, what would be the sequence of events that will result in next value to be smaller? I don’t agree that arbitrary change to the node clock would achieve that: I expect node to commit suicide if clock is that off. If, however, the difference is small, but new value is still smaller, I expect CockroachDB to think that this update falls into “uncertainty” region and restart the transaction – but I could be wrong here.

The business requirement I have is just that: every update to some object should create a history record for that object and we have to provide an API to read these history records. Saving “123” object with value “hi”, then with the value of “hello” should allow clients to read both “123”=>“hi” and “123”=>“hello”, if they are to use an API to read a particular version of the resource. Naturally, I would like for these history records to be 1) ordered (the later update should go after an earlier) 2) be somewhat related to the “real time”.

Ideally, of course, I would ALL the history to be ordered, but that’s not possible with CockroachDB without a shared entry for serialization (which will kill all the performance). I would need FoundationDB for that.


(Raphael 'kena' Poss) #5

CockroachDB uses two separate sources to measure time and produce timestamps:

  • the ordering of transactions, and the detection of serializability problems, uses a hybrid logical clock (HLC) which is guaranteed-monotonic and will, as you say, cause the node to “commit suicide” if it detects a drift.
  • all the SQL built-in functions now(), statement_timestamp(), current_timestamp() etc use the local system clock with no check whatsoever.

The system clock could go backward and CockroachDB would still work fine and resolve transactions consistently. What I’m saying is that you should not be using any assumption about what the system clock on a CockroachDB node returns to your client. It has no relationship with the HLC used for organizing transaction ordering and serializability.

Maybe there’s a case to be made that we need a new feature in CockroachDB, to expose the HLC to clients via SQL built-in functions, so that clients can do grey magic of the sort you suggest. But that’s not yet supported.


(Raphael 'kena' Poss) #6

Also regarding your application use case: CockroachDB supports “time travel queries” where you can query your data at arbitrary points in the past. No need to maintain your own update timestamps yourself.


(Ivan Dubrov) #7

Yes, this is the part where I made some assumptions which seem to be wrong.

Thanks for the clarification.

That seems to me like a useful thing to have. However, the way we use CockroachDB is probably a bit atypical :smiley:

Something similar to Spanner’s commit timestamp or FoundationDB versionstamps would be extremely useful for us.

Right, but that would only work for last 25 hours (with default configuration)?

I also need to be able to reference older versions via some key and query history as a list of revisions (both across all the resources I have and for individual resource). I don’t see how I can get around maintaining my own list of revisions.


(Raphael 'kena' Poss) #8

Fair enough, but if you need to query histories as a list of revisions, why does the time ordering of these matter? Aren’t your revisions numbered in a way that makes sense to your app?


(Ivan Dubrov) #9

We need to be able to query them based on timestamp (mandated by the API we implement), including across all of the resources. Which has all kind of issues associated with concept of time in a distributed system, but that’s what we have to work with.

So, my current design is the following:

  1. We use timestamps for ordering revisions. We DO NOT give any guarantees on linearizability across different resources (so, even if you update A earlier than B, timestamp of created revision of A still could be greater than that of B). That’s not great, but okay as we 1) cannot have it better without sacrificing the performance 2) resources are largely independent
  2. We DO guarantee monotonicity across the same resource (every new history revision we record have a timestamp that is strictly greater than the previous one). The reason I want this guarantee is that having out-of-order revisions for a single resource would be a very surprising behavior and since we are already updating a single row, we can as well use it for linearizing (using that GREATEST trick with timestamp).
  3. We allow querying against timestamps with a caveat that time is very approximate.

Perhaps, it was possible to design it differently, but given that:

  1. Revisions should be relatable to real time (although, not perfectly).
  2. Monotonic for at least a single resource.

It seems like using timestamps is okay? If I had a way to generate a truly monotonic counter without sacrificing concurrency (well, at the moment it is my pure speculation that having a single “sequence” table will kill concurrency, but I think, it will), I would go to that and I will find a different way to satisfy #1 (make it relatable to time).

The reason I was thinking to remove that extra bit of logic is to make query a bit simpler (maybe, a bit faster, too), but if I cannot, it’s okay, I’ll keep it as it is now.


(Raphael 'kena' Poss) #10

What you could consider doing is tagging every revision with some kind of unique ID (either random / uuid or using a sequence); then asynchronously in a separate process construct a mapping from these IDs to timestamps.

You could achieve this by sampling the database with AS OF SYSTEM TIME periodically and collect at which ASOT timestamp a particular ID is reported the first time (and using binary search in “busy” periods to distinguish when multiple IDs have appeared in that period).

This would lower the amount of contention on your “main” write queries, at the expense of an additional join on your time-based historical lookups.