Unique_rowid() monotonicity question

Hello all,

I have a question about the monotonicity of the unique_rowid() function. The documentation on SERIAL and unique_rowid implies that it only increases (with gaps). However, it seems that the implementation uses the node id for bits 0-14 and a timestamp for bits 15-62, leaving the final top bit unset.

It looks like there is a check to make sure the timestamp portion only increases on a given node, but is it not possible for row ids for the same table generated on different nodes to go backward, given that their clocks may not be perfectly synced?

It seems like the probability of a collision is very low, but I’m wondering if I can depend on the monotonicity for my application. I’d like to avoid sequences if possible due to the coordination overhead, so any help here would be great. Have really enjoyed using cockroach thus far and want to make sure I’m not misunderstanding something here.



Thank you for your question.
The short answer is no, unique_rowid() is not guaranteed to be monotonic. This is suggested in https://www.cockroachlabs.com/docs/stable/functions-and-operators.html#id-generation-functions – the documentation does not mention monotonicity.

Even SQL sequences (now supported in CockroachDB) do not guarantee monotonicity either: if a transaction A takes a first value X of a sequence and then another transaction B takes another value Y of the sequence, even if Y > X maybe A will be forbidden to commit before B does (because of transaction conflicts) so the observed order of X and Y will be inverted.

(This property of sequences is a general property of optimistic concurrency control and the SERIALIZABLE isolation level, and not specific to CockroachDB)

The only way to guarantee monotonicity is to use a SQL table CREATE TABLE tcounter(c INT); INSERT INTO tcounter VALUES (0); and obtain your sequencial IDs like this:

UPDATE tcounter SET c = c + 1 RETURNING c; -- extract new counter value in client
INSERT INTO yourtable(idcol, ...) VALUES (...value_obtained_above..., ...);

However as you can expect this will cause some contention on the tcounter table as the number of concurrent clients increases.

There are better schemes (e.g. pre-allocate multiple counter values at once per client connection) but this moves into the realm of “advanced” SQL usage.

Can you perhaps explain your use case a little bit more?

Got it. It seems that SQL sequences will at least be monotonic if there aren’t concurrent writes, whereas it would be possible for two sequential writes initiated from the same client process to give SERIAL id values that go backward if they spanned cockroach nodes. Is that accurate?

My goal is essentially to support append only lists keyed off some string. This would provide the interface:

put(key string, value string) -> Cursor   // append value to list for key and return position in list
get(key string, after Cursor) -> []string  // get all values for key after given position 

Because the monotonicity of the cursor is only necessary within a key, if inserts on a given key are infrequent enough, using a serial id for the cursor seems like it would be good enough:

    id SERIAL,
    key STRING(256),
    value STRING,
    PRIMARY KEY (key, id)


INSERT INTO items (key, value) VALUES ( 'somekey' , 'some value' ) RETURNING id


SELECT id, value FROM items WHERE key = 'somekey' AND id > afterval ORDER BY id

However if a single client process were to make successive put calls fast enough, it seems possible that the cursor values could go backward if using SERIAL due to the timestamp concerns. Perhaps it would be better to use your tcounter example with a counter per key, in which case there would only be contention on frequent writes to the same key. Or maybe there’s another mechanism I’m missing.

Again any feedback would be great, and I really appreciate the help here!

I think your problem statement is still incomplete.

What makes two values of Cursor differ after two calls to put?

If I do the following:

begin; x := put('a', 'b'); commit;  // in client A
begin; y := put('a', 'c');  commit; // in client B

s = get('a', x); // in client C
t = get('a', y); // in client D

What values do you expect for s and t?

For concurrent writes on the same key by different clients, there would be no guarantees on the resulting ordering. So in the given case there should be two possible outcomes:

s = ['c'], t = [] (client A’s write resulted in a lower cursor)
s = [], t = ['b'] (client B’s write resulted in a lower cursor)

(get returns all items after the given cursor)

However, if both put calls were executed successively from the same client, there should be only one valid result, which would be:

s = ['c'], t = []

My concern is that using a SERIAL id for the cursor may not guarantee this

Ok we’re narrowing this down now.

Now consider a single client, running sequentially, but with 2 connections open to the same server.

Client does the following, in the following order:

  • using conn 1, opens a new transaction:
  • using conn 1, runs put, gets cursor X
  • using conn 1, commits
  • using conn 2, open a new txn;
  • using conn 2, runs put, gets cursor Y
  • using conn 2, commits

Do you expect given this order that X<Y always? X>Y always? or are you ok if it’s not deterministic?

  • using conn 1, opens a new transaction:
  • using conn 2, open a new txn;
  • using conn 1, runs put, gets cursor X
  • using conn 2, runs put, gets cursor Y
  • using conn 1, commits
  • using conn 2, commits

Do you expect given this order that X<Y always? X>Y always? or are you ok if it’s not deterministic?

What of the following:

  • using conn 1, opens a new transaction:
  • using conn 2, open a new txn;
  • using conn 1, runs put, gets cursor X
  • using conn 2, runs put, gets cursor Y
  • using conn 2, commits
  • using conn 1, commits (note: conn 2 commits before conn 1)

Do you expect given this order that X<Y always? X>Y always? or are you ok if it’s not deterministic?

I think I’d treat a client simultaneously starting transactions on separate connections as separate clients functionally. So:

Case 1: X<Y
Case 2: Non-deterministic
Case 3: Non-deterministic

Appreciate the thoroughness here!

Ok so what you really want is a causality token, to chain the commit time of one txn with the begin time of another (within 1 client or multiple clients).

There is no difference between one or multiple clients btw, because perhaps you will lose a db node, or the network link to it, in between the two txns, and you’ll need to start a txn to another server. Without a causality token, you cannot rely on time to guarantee that a commit made on server A “precedes” a begin on server B for the purpose of your operator (there is no global order!).

Ok so you need a causality token. You are in luck, we have started to work on that in CockroachDB 2.0. The feature is still experimental (and thus subject to change), but you can get your token with the built-in function cluster_logical_timestamp(). It has type DECIMAL.

How this works:

  • in the put txn, obtain/store the value of cluster_logical_timestamp(). This will be your “Cursor”.

  • in the get(after) txn, access your value with "select ... where cursor > after".

For maximum scalability, you should have a primary key on (key, cursor) to ensure that the table is evenly distributed over the key value space.

Also be very sure that you have implemented the client-directed retry protocol documented at https://www.cockroachlabs.com/docs/stable/transactions.html#client-side-intervention – the moment your client starts to use cluster_logical_timestamp() you’ll start getting more frequent retry errors.

Please let me know if you can move forward with that.
Again remember this feature is experimental and the API may change in a future version.

Note that you will not be guaranteed that two unrelated separate transactions see cluster_logical_timestamp() ordered monotonically. The function is “special” because its final value depends on the other things the transaction and the cluster at large does. If this special function discovers that the value your client “saw” for it was wrong by the time the txn commits, it will force the txn to retry (and report a retry error to your client) to indicate that the value reported earlier was wrong.

Got it. While that seems useful, if possible I’d rather not rely on an experimental API for this service. I also expect concurrent writes to the same key to be fairly rare in which case it seems like using cluster_logical_timestamp for all writes could unnecessarily increase the chance of retry errors.

Would it instead be possible to use a sequence per key? Or manually maintain a counter per key as described earlier? I assume the downside here would be slower puts, but that could be preferable to frequent retry errors

Sequences will not get you monotonicity, as I explained initially.

One counter per key would be the way to go for that. The puts would indeed be slower, but at least the performance will be scalable across the keys.

Great. Appreciate all the help!

1 Like


As a follow up here, I was wondering if it’s possible to do the monotonically increasing column without needing a separate table at all. For the table:

    key STRING(256),
    value STRING,
    position INT NOT NULL,
    PRIMARY KEY (key, position)

would the following insert be able to guarantee that (key, position) is never duplicated?

INSERT INTO itmes (key, value, position)
VALUES ('some key', 'some val', (
    WITH p AS (SELECT MAX(position) FROM items where key = 'some key')
    SELECT (IFNULL(max, 0) + 1) FROM p
ON CONFLICT (key, position)
DO UPDATE SET position = items.position + 1 RETURNING position

If I understand correctly, this seems like it could result in a unique condition on (key, position) failing if two concurrent inserts for the same key were to run. However, the docs seem to say that single SQL statements are run as transactions, in which case would this result in a retry automatically?

The intention with this is to avoid the possible round trip to another node to deal with a separate table maintaining the position values.

Any help would be very much appreciated. Thanks!

This will work like you say if there is an index on position. otherwise you’d get an inefficient table scan

Right. And in the case of a collision on (key, position), is that something cockroach will retry automatically or will it be up to the client?

Also if there is a good section in the docs that may clarify, that’d be helpful as well. Thanks again for the help

Given the statement I think it will be retried automatically. Not sure though, you could try it and see what happens.

Ok, will do. Thanks for all the help, I really appreciate it!

1 Like