Performance considerations on multiple unions vs multiple or clauses

I admit to being a bit lazy and not empirically validating some assumptions here, but I am curious if there’s a simple answer to: how many compound or clauses will the optimizer unroll into index/PK lookups? And what the general performance practices are around choosing when to union to hit the PK vs an in clause.

Say I have the following table:

CREATE TABLE thing_labels (
  id UUID NOT NULL PRIMARY KEY,
  key STRING NOT NULL,
  value STRING NOT NULL,
  thing_id UUID NOT NULL,
  -- notice this is not a unique index
  INDEX (key, value)
);

In general which is the better query on disjointed label lookup? Is it:

SELECT thing_id FROM thing_labels
WHERE (key = 'fookey' AND value = 'foovalue') OR
      (key = 'barkey' AND value = 'barvalue') OR
      (key = 'bazkey' AND value = 'bazvalue')

Or is it:

SELECT thing_id FROM thing_labels WHERE key = 'fookey' AND value = 'foovalue'
UNION ALL
SELECT thing_id FROM thing_labels WHERE key = 'barkey' AND value = 'barvalue'
UNION ALL
SELECT thing_id FROM thing_labels WHERE key = 'bazkey' AND value = 'bazvalue'

Or might the optimizer convert them to the same thing? Does that hold no matter the key/value lookup count I have? I would like a single query to be atomic w/out explicit transaction and client-side retry. My use case doesn’t support inverted index for this use case.

I am wondering if there is general best practice here? E.g. “union is better because you’ll always hit the indexes no matter how many key/values you match” or “OR clauses are better because union does an extra expensive thing” or “the cost-based optimizer is not predictable here and it really depends on table size and key/value lookup count, and there isn’t a specific number for when it switches to scanning vs index use”.

Thanks!

In this case the optimizer does not generate the same query plan. You can see the query plan by prepending EXPLAIN to each query. On v20.2.8:

  1. OR clauses:
EXPLAIN
SELECT thing_id FROM thing_labels
WHERE (key = 'fookey' AND value = 'foovalue') OR
      (key = 'barkey' AND value = 'barvalue') OR
      (key = 'bazkey' AND value = 'bazvalue')
;
     tree    |     field     |
          description
-------------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------
             | distribution  | local
             | vectorized    | false
  index join |               |
   │         | table         | thing_labels@primary
   └── scan  |               |
             | missing stats |
             | table         | thing_labels@thing_labels_key_value_idx
             | spans         | [/'barkey'/'barvalue' - /'barkey'/'barvalue'] [/'bazkey'/'bazvalue' - /'bazkey'/'bazvalue'] [/'fookey'/'foovalue' - /'fookey'/'foovalue']
  1. UNION ALL:
EXPLAIN
SELECT thing_id FROM thing_labels WHERE key = 'fookey' AND value = 'foovalue'
UNION ALL
SELECT thing_id FROM thing_labels WHERE key = 'barkey' AND value = 'barvalue'
UNION ALL
SELECT thing_id FROM thing_labels WHERE key = 'bazkey' AND value = 'bazvalue'
                             -> ;
          tree         |     field     |                  description
-----------------------+---------------+------------------------------------------------
                       | distribution  | local
                       | vectorized    | false
  union all            |               |
   ├── union all       |               |
   │    ├── index join |               |
   │    │    │         | table         | thing_labels@primary
   │    │    └── scan  |               |
   │    │              | missing stats |
   │    │              | table         | thing_labels@thing_labels_key_value_idx
   │    │              | spans         | [/'fookey'/'foovalue' - /'fookey'/'foovalue']
   │    └── index join |               |
   │         │         | table         | thing_labels@primary
   │         └── scan  |               |
   │                   | missing stats |
   │                   | table         | thing_labels@thing_labels_key_value_idx
   │                   | spans         | [/'barkey'/'barvalue' - /'barkey'/'barvalue']
   └── index join      |               |
        │              | table         | thing_labels@primary
        └── scan       |               |
                       | missing stats |
                       | table         | thing_labels@thing_labels_key_value_idx
                       | spans         | [/'bazkey'/'bazvalue' - /'bazkey'/'bazvalue']

The first query is simpler and more efficient. It scans only three values in the index. That being said, I wouldn’t expect the second query to be orders of magnitude slower because it scans the same number of rows in the index. It’s less efficient due to the overhead from the addition index join and union all operators.

In general, I’d expect queries with OR clauses to always be more efficient, as long as each OR clause is constraining the same indexed columns. Keep in mind that if the number of OR clauses grew such that an appreciable percentage of the table was scanned, the optimizer might decide that a full table scan is more performant. I’d only expect this to happen with a high number of OR clauses and few rows in the table.

1 Like

Keep in mind that if the number of OR clauses grew such that an appreciable percentage of the table was scanned, the optimizer might decide that a full table scan is more performant.

Yup, saw this on 21.x for a small table, which is fine and makes sense.

In general, I’d expect queries with OR clauses to always be more efficient, as long as each OR clause is constraining the same indexed columns

Thanks! So my big question then (ignoring that the optimizer may determine % of index covers deserves full scan): is there any reasonable/theoretical limit in the code to the number of OR clauses the optimizer won’t unroll?

So my big question then (ignoring that the optimizer may determine % of index covers deserves full scan): is there any reasonable/theoretical limit in the code to the number of OR clauses the optimizer won’t unroll?

No, there is no such limit.

I imagine there is some practical limit on the size of a single SQL statement, but I don’t know of any hard limit imposed by CockroachDB. And if you hit such a limit, converting OR clauses to UNIONs wouldn’t help you anyway.

1 Like