Speed of a select on primary key

Hi,

i have a 5-nodes cluster, with one database of 4 tables. The tables are really simple and structured like this:

CREATE TABLE title (
      my_id STRING(20) NOT NULL,
      value STRING NULL,
      CONSTRAINT "primary" PRIMARY KEY (my_id ASC),
      FAMILY "primary" (my_id, value)
)

CREATE TABLE list (
      my_id STRING(20) NOT NULL,
      value STRING[] NULL,
      CONSTRAINT "primary" PRIMARY KEY (my_id ASC),
      FAMILY "primary" (my_id, value)
)

CREATE TABLE abstract (
      my_id STRING(20) NOT NULL,
      value STRING NULL,
      CONSTRAINT "primary" PRIMARY KEY (my_id ASC),
      FAMILY "primary" (my_id, value)
)

CREATE TABLE body (
      my_id STRING(20) NOT NULL,
      value STRING NULL,
      CONSTRAINT "primary" PRIMARY KEY (my_id ASC),
      FAMILY "primary" (my_id, value)
)

(notice the string[] type for table_b). Basically i use 4 tables to store separate sections of documents, having the same structure:

  • title section is a string usually no longer than ~200 characters

  • list section is a list of sentences, usually no longer than 10 senteces, each one 3000 characters long.

  • body section is a string very variable, going from ~20000 to ~200000 characters or more

  • abstract section is a string usually no longer than ~1000 characters

the documents i store usually have an abstract and a title, and far less usually they have a description and a list. Let’s say on something like 100.000.000 documents i have roughly 100.000.000 titles, 80.000.000 abstract, 15.000.000 body and 15.000.000 claims.

i wanted to extract using a single query all of the keys for every table. Therefore i proceeded to select my_id from <title|abstract|list|body>. What i noticed is counterintuitive: the tables with a far superior cardinality are the faster ones to dump all the keys. It takes me hours to extract all of the keys on tables title and abstract, half a day to extract all the keys from list, and as i’m writing i’m running a query launched a day ago to extract all of the keys from body, and is still running.

So, what’s happening? i have two hypotesis:

  1. the index is too much “sparse”, in the sense that the documents that have a value in body don’t have contiguous values for the primary key, and since the query want to scan all of the keyspace, it have to deal with holes

  2. the size of the second column (value) has some kind of impact on how the keys are stored in the sstable, and since the length of these values are somewhat not constant, the process is slowed down.

this is probably reflecting on aggregate functions like count(my_id), but since i knew they are not very feasable options, i didn’t mind. My concern here is that i’m probably missing something about how rocksdb and crdb cooperate to retrieve simple queries like this, and how this could affect more interesting queries, like perfoming a select for retrieving the body of a bunch of documents.

Hi @whattheheck—great question! Your second hypothesis is spot on. What matters far more than the cardinality of the table is the amount of data in the index.

Let’s take your title table, and assume it has ID 51. The keys that get stored in RocksDB will look, roughly, like this:

/Table/51/1/{my_id} → {title}

The /1/ between the table ID and the primary key value, {my_id}, is the ID of the primary index. Your other tables have a similar structure and thus a very similar key encoding.

These key-value pairs get packed tightly into SSTs:

/Table/51/1/1 → "The C Programming Language"
/Table/51/1/2 → "The Art of Computer Programming"
/Table/51/1/3 → "The Mythical Man-Month"
...

There’s thus no trouble with sparsity; IDs 50 and 500000000 will be packed right next to each other if there are no intervening IDs. The problem is that even if you’re only interested in the primary key, you’re stuck scanning over all the values.

Assuming every character can be represented in one byte, this is the approximate size of each table on disk, based on the figures you provided:

  • title: 200B × 100 million rows = 20GB
  • abstract: 1000B × 80 million rows = 80GB
  • list: 3000B × 10 sentences × 15 million rows = 450GB
  • body: 200KB × 15 million rows = 3TB

So it’s no wonder that scanning body takes as long as it does! It’s two orders of magnitude larger than the title table.

You can speed up reads, at the cost of making writes more expensive, by creating a secondary index on the my_id column. The title table schema, for example, would look like this:

CREATE TABLE title (
      my_id STRING(20) NOT NULL PRIMARY KEY,
      value STRING NULL,
      INDEX (my_id)
)

Secondary indexes map the index key to the primary index key:

/Table/51/2/1 → 1
/Table/51/2/2 → 2
/Table/51/2/3 → 3
...

Since the keys are small integers, scanning this table will be very quick. And, at least in the upcoming v2.1, our optimizer is smart enough to automatically use the secondary index to satisfy a SELECT count(*). I’m not sure if v2.0 does the same, but if not, you can force a selection to use a particular index.

Hi @benesch,

thanks for the explanation and for the great advice, everything really makes more sense now. This let me think about the performances on body table for more simple cases; for example a query like select value from body where my_id = <id>, which i noticed in certain cases can perform very poorly, by taking tens of seconds to success. Reading a bit more about primary keys and secondary indexes, i now suppose that a secondary index on my_id could help in performing better, am i right?

Another question: what is the amount of space consumed from a secondary index on a string(20) used as primary key? How can i estimate some rough number? Thanks again.

a query like select value from body where my_id = , which i noticed in certain cases can perform very poorly, by taking tens of seconds to success. Reading a bit more about primary keys and secondary indexes, i now suppose that a secondary index on my_id could help in performing better, am i right?

That query should never take tens of seconds to succeed, actually! While it’s slow to scan through a primary index with large values, it’s still quite fast to seek to a particular location within it.

Are you running that query after a long period of inactivity, by chance? We quiesce inactive ranges, so it can sometimes take several seconds to warm things back up after a period of inactivity. I can give you some workarounds if that is indeed the case.

I wouldn’t expect a secondary index to speed up a point query like that. A secondary index would be strictly worse, in fact, as you’d have to do two lookups. One lookup in the secondary index to get the primary key ID, and then another in to the primary index to get the value. In this case you already have the primary key ID, so you should go straight to the primary index.

Another question: what is the amount of space consumed from a secondary index on a string(20) used as primary key? How can i estimate some rough number? Thanks again.

It’s hard to say, as there are quite a few layers involved. SSTs have some overhead, the LSM tree adds additional space amplification, and CockroachDB adds a several-byte table prefix and 12-byte timestamp to each key. But we also use key-prefix and Snappy compression, which cancels out some of the overhead. If you’re only looking for order of magnitude, you can use the raw data size directly. The amount of space consumed from a secondary index on a string will be roughly the average size of the string × the number of rows in the index.

Note that applying a maximum size (i.e., STRING(20) instead of STRING) doesn’t save any space in CockroachDB. Strings always use a variable-length encoding.

Well, yes and no. I have to perform several queries, usually sparse over the ranges, and they are single-shot (i usually retrieve a specific value once in a while). Suggestions are very welcome

I see, thanks.

Ah, as long as you’re touching the table with a query every few minutes, you’ll keep the lease up-to-date.

If this reproduces easily, it would be extremely helpful to see a “trace” of query execution. Instructions are here. Basically, you run SHOW TRACE query, and Cockroach diagnose where it spends time while executing the query.