Best practice for UUID primary keys

Hi,

Are there any best practices around using a UUID field for a primary key vs using a serial id (id SERIAL PRIMARY KEY NOT NULL) and just adding the UUID (ulid STRING UNIQUE NOT NULL) as a field with a separate index?

The former seems like it might yield very large indexes.

Ciao
Charl

Here are the tables definitions we used for testing this.

UUID field for a primary key:

CREATE TABLE tasks_ulid (
ulid STRING PRIMARY KEY NOT NULL, 
data JSONB,
created_at TIMESTAMP, 
completed_at TIMESTAMP, 
INDEX created_at_idx (created_at DESC), 
INDEX completed_at_idx (completed_at DESC), 
INVERTED INDEX detail_idx (data)
);

Serial id:

CREATE TABLE tasks (
id SERIAL PRIMARY KEY NOT NULL, 
ulid STRING UNIQUE NOT NULL, 
data JSONB, 
created_at TIMESTAMP, 
completed_at TIMESTAMP, 
UNIQUE INDEX ulid_idx (ulid ASC), 
INDEX created_at_idx (created_at DESC), 
INDEX completed_at_idx (completed_at DESC), 
INVERTED INDEX detail_idx (data)
);

We then inserted a bunch of rows into each and it looks like tasks_ulid is ~4x larger for the same amount of rows than tasks.

Try making ulid a UUID instead of string type: uuids are 128 bit numbers, so they should be able to be stored in no more than 16 bytes, but the usual string representation is 36 characters, which likely explains part of that difference. Since the PK value has to be stored in the index entries of all the other indexes as well, that could add up.

That said, SERIAL’s 8 bytes is still going to be smaller than even an ideally encoded UUID‘s 16.

Additionally rows with similar primary key values use a special form of key compression, storing just the difference from the previous key. With SERIAL’s ordered-ish key generation, that might be more effective than uuid’s uniform randomness?

We use ULIDs for the UUID and in our testing we noticed the following:

Rows tasks (SERIAL) tasks_ulid tasks_uuid (gen_random_uuid())
100 11.5 KiB 45.4 KiB 287 KiB
1000 561.6 KiB 452.2 KiB 1000 KiB
10000 4000 KiB 3400 KiB 6200 KiB

Observationally the UUID as primary key works better over larger data sets from a storage perspective. I am not too sure what this would mean for query performance though.

Interesting! I hadn’t seen this uuid-scheme before, but since it is UUID-compatible, I think you could still use a UUID-typed column to get the smaller storage size (compared to string). It is actually not entirely unlike our serial generation logic, which is also somewhat time-based (along with the generating node’s ID).

The difference in those numbers (when using the same column type I assume), is pretty interesting though: I’m guessing that at least partially comes down to compression – the key-delta compression and just the block-level snappy compression employed by our storage layer (rocksdb), both of which would obviously can’t do much with random keys but can probably do quite a bit on ordered-ish keys.

One final bit of feedback. Here’s the outlook at 100000 records:

Rows tasks (SERIAL) tasks_ulid tasks_uuid (gen_random_uuid())
100000 47MiB 35.1MiB 72.7 MiB

Observations:

  • UUID-typed column primary keys (tasks_uuid table) perform really badly the larger your data grows.
  • SERIAL int primary keys (tasks table) perform second best.
  • ULID string column primary keys (tasks_ulid) perform the best.

As you mentioned I think the monotonic nature of ULID indexes provide massive opportunity for compression.

1 Like

I’m reviving this interesting conversation in order to hopefully get more insight from @charl on what you mean by “UUID perform really badly the larger your data grows”.

By performance, do you storage performance (it takes up exponentially more space as the data grows?), or do you mean query performance (queries get slower and slower the more rows you have?).

I have to use a string representation of a UUID and can’t store it as a UUID type in the DB due to outside circumstances. So this is quite relevant to my case.

Performance here was related to table size (on disk) for the different UUID types mentioned.

We had not done any thorough testing of real world queries using the different UUID schemes.

We use ULIDs that generate monotonic unique UUIDs that have the property of being sortable. We store these as a STRING in our tables and based on the comment from @david we are likely benefitting from:

"Additionally rows with similar primary key values use a special form of key compression, storing just the difference from the previous key. " .