High costs for relatively simple query

Hello roachers,

I’m experiencing a strange costs problem. At the core, there are campaigns (1) - (*) vouchers (1) - (*) redemptions and I want calculate for every campaign, how many vouchers has at least one redemption in state “locked” or “redeemed”. Relevant pieces:

show create loyalty_voucher_voucher;
        table_name        |                                                      create_statement
--------------------------+------------------------------------------------------------------------------------------------------------------------------
  loyalty_voucher_voucher | CREATE TABLE loyalty_voucher_voucher (
                          |     id UUID NOT NULL DEFAULT gen_random_uuid(),
                          |     code STRING NOT NULL,
                          |     campaignid UUID NOT NULL,
[...]
                          |     CONSTRAINT "primary" PRIMARY KEY (id ASC),
                          |     CONSTRAINT loyalty_voucher_voucher_campaignid_foreign FOREIGN KEY (campaignid) REFERENCES loyalty_voucher_campaign(id),
                          |     UNIQUE INDEX loyalty_voucher_voucher_code_unique (code ASC),
                          |     INDEX loyalty_voucher_voucher_auto_index_loyalty_voucher_voucher_campaignid_foreign (campaignid ASC),
                          |     INDEX ix_loyalty_voucher_voucher__anton_code_campaignid (code ASC, campaignid ASC),
                          |     FAMILY "primary" (id, [...], campaignid)
                          | )

show create loyalty_voucher_redemption;
          table_name         |                                                         create_statement
-----------------------------+------------------------------------------------------------------------------------------------------------------------------------
  loyalty_voucher_redemption | CREATE TABLE loyalty_voucher_redemption (
                             |     id UUID NOT NULL DEFAULT gen_random_uuid(),
[...]
                             |     status STRING NOT NULL,
                             |     vouchercode STRING NOT NULL,
                             |     CONSTRAINT "primary" PRIMARY KEY (id ASC),
                             |     CONSTRAINT loyalty_voucher_redemption_vouchercode_foreign FOREIGN KEY (vouchercode) REFERENCES loyalty_voucher_voucher(code),
                             |     INDEX ix_loyalty_voucher_redemption__status (status ASC),
                             |     INDEX loyalty_voucher_redemption_auto_index_loyalty_voucher_redemption_vouchercode_foreign (vouchercode ASC),
                             |     INDEX ix_loyalty_voucher_redemption__anton_status_vouchercode (status ASC, vouchercode ASC),
                             |     FAMILY "primary" (id, [...], status, vouchercode),
                             |     CONSTRAINT check_status CHECK (status IN ('available':::STRING, 'locked':::STRING, 'redeemed':::STRING))
                             | )

The most efficient query I came with produces following plan:

explain 
SELECT
        campaignid,
        COUNT(*) as vouchersredeemedcount
FROM loyalty_voucher_voucher
WHERE EXISTS (
        SELECT 1 FROM loyalty_voucher_redemption
        WHERE
                loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
                AND loyalty_voucher_redemption.status IN (
                        'redeemed',
                        'locked')
)
GROUP BY campaignid;
              tree              |         field         |                                    description
--------------------------------+-----------------------+-------------------------------------------------------------------------------------
                                | distributed           | true
                                | vectorized            | false
  group                         |                       |
   │                            | aggregate 0           | campaignid
   │                            | aggregate 1           | count_rows()
   │                            | group by              | campaignid
   └── render                   |                       |
        └── lookup-join         |                       |
             │                  | table                 | loyalty_voucher_voucher@ix_loyalty_voucher_voucher__anton_code_campaignid
             │                  | type                  | inner
             │                  | equality              | (vouchercode) = (code)
             │                  | equality cols are key |
             │                  | parallel              |
             └── distinct       |                       |
                  │             | distinct on           | vouchercode
                  └── render    |                       |
                       └── scan |                       |
                                | table                 | loyalty_voucher_redemption@ix_loyalty_voucher_redemption__anton_status_vouchercode
                                | spans                 | /"locked"-/"locked"/PrefixEnd /"redeemed"-/"redeemed"/PrefixEnd


explain (opt, verbose)
SELECT
        campaignid,
        COUNT(*) as vouchersredeemedcount
FROM loyalty_voucher_voucher
WHERE EXISTS (
        SELECT 1 FROM loyalty_voucher_redemption
        WHERE
                loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
                AND loyalty_voucher_redemption.status IN (
                        'redeemed',
                        'locked')
)
GROUP BY campaignid;
                                                     text
---------------------------------------------------------------------------------------------------------------
  group-by
   ├── columns: campaignid:7 vouchersredeemedcount:19
   ├── grouping columns: campaignid:7
   ├── stats: [rows=12.4293442, distinct(7)=12.4293442, null(7)=0]
   ├── cost: 211277.991
   ├── key: (7)
   ├── fd: (7)-->(19)
   ├── prune: (19)
   ├── project
   │    ├── columns: code:2 campaignid:7
   │    ├── stats: [rows=21.0844401, distinct(2)=21, null(2)=0, distinct(7)=12.4293442, null(7)=0]
   │    ├── cost: 211277.214
   │    ├── key: (2)
   │    ├── fd: (2)-->(7)
   │    ├── prune: (7)
   │    ├── interesting orderings: (+2,+7) (+7)
   │    └── inner-join (lookup loyalty_voucher_voucher@ix_loyalty_voucher_voucher__anton_code_campaignid)
   │         ├── columns: code:2 campaignid:7 vouchercode:16
   │         ├── key columns: [16] = [2]
   │         ├── lookup columns are key
   │         ├── stats: [rows=21.0844401, distinct(2)=21, null(2)=0, distinct(16)=21, null(16)=0]
   │         ├── cost: 211276.993
   │         ├── key: (16)
   │         ├── fd: (2)-->(7), (2)==(16), (16)==(2)
   │         ├── interesting orderings: (+2,+7) (+7)
   │         ├── distinct-on
   │         │    ├── columns: vouchercode:16
   │         │    ├── grouping columns: vouchercode:16
   │         │    ├── stats: [rows=21, distinct(16)=21, null(16)=0]
   │         │    ├── cost: 211149.76
   │         │    ├── key: (16)
   │         │    └── scan loyalty_voucher_redemption@ix_loyalty_voucher_redemption__anton_status_vouchercode
   │         │         ├── columns: status:13 vouchercode:16
   │         │         ├── constraint: /13/16/9
   │         │         │    ├── [/'locked' - /'locked']
   │         │         │    └── [/'redeemed' - /'redeemed']
   │         │         ├── stats: [rows=197336, distinct(13)=2, null(13)=0, distinct(16)=21, null(16)=0]
   │         │         │   histogram(13)=  0 1.9732e+05 0      19
   │         │         │                 <--- 'locked' --- 'redeemed'
   │         │         ├── cost: 207202.81
   │         │         ├── prune: (16)
   │         │         └── interesting orderings: (+13,+16) (+16)
   │         └── filters (true)
   └── aggregations
        └── count-rows [as=count_rows:19]

Usually, the exists check is very optimized in postgres. I’ve tried with following queries:

SELECT campaignid, count(*) FROM loyalty_voucher_redemption
JOIN loyalty_voucher_voucher
ON loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
WHERE
	loyalty_voucher_redemption.status IN (
		'redeemed',
		'locked')
GROUP BY campaignid
;

and

SELECT
	campaignid,
	(
		SELECT count(*) FROM loyalty_voucher_redemption
		WHERE
			loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
			AND loyalty_voucher_redemption.status IN (
				'redeemed',
				'locked')
	)
FROM loyalty_voucher_voucher
;

Be aware that this query is a part from a bigger one with limit 100. And a little bit outdated CRDB version is used: cockroachdb/cockroach:v20.1.9 .

Some statistics:

select count(*) from loyalty_voucher_campaign;
  count
---------
     18

select count(*) from loyalty_voucher_voucher;
  count
---------
   8240

select status, count(*) from loyalty_voucher_redemption group by status;
   status  | count
-----------+---------
  locked   | 197330
  redeemed |      6

select count(*) from loyalty_voucher_redemption;
  count
----------
  197336

select count(*) from loyalty_voucher_redemption group by vouchercode having count(*) > 1;
  count
---------
     27
     73
      3
      2
     71
     11
  97876
  98868
      2
     46
     45
     47
    112
    146

Execution time of the first query from cli:

SELECT
        COUNT(*) as vouchersredeemedcount
FROM loyalty_voucher_voucher
WHERE EXISTS (
        SELECT 1 FROM loyalty_voucher_redemption
        WHERE
                loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
                AND loyalty_voucher_redemption.status IN (
                        'redeemed',
                        'locked')
)
GROUP BY campaignid;
  vouchersredeemedcount
-------------------------
                      1
                      1
                      1
                      1
                      1
                      1
                      1
                      1
                      4
                      9
(10 rows)

Time: 140.392981ms

Any ideas?

Adding limit in inner query leads to better costs, but horrible execution time:

explain (opt, verbose)
SELECT
        campaignid,
        COUNT(*) as vouchersredeemedcount
FROM loyalty_voucher_voucher
WHERE EXISTS (
        SELECT 1 FROM loyalty_voucher_redemption
        WHERE
                loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
                AND loyalty_voucher_redemption.status IN (
                        'redeemed',
                        'locked')
        LIMIT 1
)
GROUP BY campaignid;
                                                                                                                                                                                                                                                           text
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  group-by
   ├── columns: campaignid:7 vouchersredeemedcount:19
   ├── grouping columns: campaignid:7
   ├── stats: [rows=18, distinct(7)=18, null(7)=0]
   ├── cost: 11060.3475
   ├── key: (7)
   ├── fd: (7)-->(19)
   ├── prune: (19)
   ├── semi-join-apply
   │    ├── columns: code:2 campaignid:7
   │    ├── stats: [rows=8240, distinct(7)=18, null(7)=0]
   │    ├── cost: 10812.9475
   │    ├── key: (2)
   │    ├── fd: (2)-->(7)
   │    ├── prune: (7)
   │    ├── interesting orderings: (+2,+7) (+7) (+13,+16) (+16)
   │    ├── scan loyalty_voucher_voucher@ix_loyalty_voucher_voucher__anton_code_campaignid
   │    │    ├── columns: code:2 campaignid:7
   │    │    ├── stats: [rows=8240, distinct(7)=18, null(7)=0]
   │    │    │   histogram(7)=  0                    1                     13                   999                    2                   999                    0                   999                    0                   999                    0                   999                    2                   999                    0                    30                    0                   999                    0                   200                    0                   999
   │    │    │                <--- '012e1b1e-189f-40ba-ad66-ce45d4b096d7' ---- '583e7a63-6ad6-4e74-a073-bcc07c1c4f0a' --- '68259708-2c86-47a8-9e45-b18ac6b8c3a1' --- '6e32ccfc-7d1f-48b5-a6be-51446f8cf833' --- '6fe3a069-99c9-4e6e-b1ad-85a778b8420b' --- '8fd8072b-8155-49b4-9f6c-f72f1527aa97' --- 'b8a36b7b-2117-4ea1-8e4d-e3869d76d526' --- 'c4cf7ae8-9591-4831-b491-15d7abf3cb2c' --- 'e4bbbb3f-a665-4c5f-a564-8838ac1c7112' --- 'f03914fe-3a8e-4ea3-8be5-4d162af5eeed' --- 'f147cba9-bd6b-4638-bbaf-2da033981aad'
   │    │    ├── cost: 8652.02
   │    │    ├── key: (2)
   │    │    ├── fd: (2)-->(7)
   │    │    ├── prune: (2,7)
   │    │    └── interesting orderings: (+2,+7) (+7)
   │    ├── limit
   │    │    ├── columns: status:13 vouchercode:16
   │    │    ├── outer: (2)
   │    │    ├── cardinality: [0 - 1]
   │    │    ├── stats: [rows=1]
   │    │    ├── cost: 1975.5
   │    │    ├── key: ()
   │    │    ├── fd: ()-->(13,16)
   │    │    ├── interesting orderings: (+13,+16) (+16)
   │    │    ├── select
   │    │    │    ├── columns: status:13 vouchercode:16
   │    │    │    ├── outer: (2)
   │    │    │    ├── stats: [rows=197336, distinct(2)=1, null(2)=0, distinct(13)=2, null(13)=0]
   │    │    │    │   histogram(13)=  0 1.9732e+05 0      19
   │    │    │    │                 <--- 'locked' --- 'redeemed'
   │    │    │    ├── cost: 1975.48
   │    │    │    ├── fd: ()-->(16)
   │    │    │    ├── limit hint: 1.00
   │    │    │    ├── interesting orderings: (+13,+16) (+16)
   │    │    │    ├── scan loyalty_voucher_redemption@ix_loyalty_voucher_redemption__anton_status_vouchercode
   │    │    │    │    ├── columns: status:13 vouchercode:16
   │    │    │    │    ├── constraint: /13/16/9
   │    │    │    │    │    ├── [/'locked' - /'locked']
   │    │    │    │    │    └── [/'redeemed' - /'redeemed']
   │    │    │    │    ├── stats: [rows=197336, distinct(13)=2, null(13)=0]
   │    │    │    │    │   histogram(13)=  0 1.9732e+05 0      19
   │    │    │    │    │                 <--- 'locked' --- 'redeemed'
   │    │    │    │    ├── cost: 2.11
   │    │    │    │    └── limit hint: 1.00
   │    │    │    └── filters
   │    │    │         └── vouchercode:16 = code:2 [outer=(2,16), constraints=(/2: (/NULL - ]; /16: (/NULL - ]), fd=(2)==(16), (16)==(2)]
   │    │    └── 1
   │    └── filters (true)
   └── aggregations
        └── count-rows [as=count_rows:19]


SELECT
        COUNT(*) as vouchersredeemedcount
FROM loyalty_voucher_voucher
WHERE EXISTS (
        SELECT 1 FROM loyalty_voucher_redemption
        WHERE
                loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
                AND loyalty_voucher_redemption.status IN (
                        'redeemed',
                        'locked')
        LIMIT 1
)
GROUP BY campaignid;
  vouchersredeemedcount
-------------------------
                      1
                      4
                      1
                      1
                      1
                      1
                      9
                      1
                      1
                      1
(10 rows)

Time: 8.159681076s

Hello!

Is your main concern with the performance of the query or the cost listed in the EXPLAIN (OPT, VERBOSE) output?

The cost you see in EXPLAIN (OPT, VERBOSE) is used by the optimizer internally to determine the most efficient query plan among many potential query plans. Comparing costs between two different queries doesn’t always make sense. So it’s probably not something to worry about.

I’ll point out that your most efficient query returns different results than the simpler query. I believe the simpler one is actually incorrect.

Efficient:

SELECT campaignid, COUNT(*) as vouchersredeemedcount
FROM loyalty_voucher_voucher
WHERE EXISTS (
  SELECT 1 FROM loyalty_voucher_redemption
  WHERE loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
    AND loyalty_voucher_redemption.status IN ('redeemed', 'locked')
)
GROUP BY campaignid;

Simpler:

SELECT campaignid, count(*)
FROM loyalty_voucher_redemption
JOIN loyalty_voucher_voucher
  ON loyalty_voucher_redemption.vouchercode = loyalty_voucher_voucher.code
WHERE loyalty_voucher_redemption.status IN ('redeemed', 'locked')
GROUP BY campaignid;

Notice that the “efficient” query returns the campaigns for which there is a voucher with at least one redeemed or locked redemption, and the number of such vouchers. The “simpler” query returns the campaigns for which there is a voucher with at least one redeemed or locked redemption, and the number of redemptions. This subtle difference is likely the reason why the “simpler” query can’t be further optimized.

Assuming that the results of the “efficient” query are what you want, I think that’s the best way to write such a query.

If you’re still concerned about it’s performance and you think it should execute faster, I might be able to help further if you sent a statement bundle which can be gathered with EXPLAIN ANALYZE (DEBUG) ....

Usually, the exists check is very optimized in postgres.

I’m also curious what you mean by this. Is PG’s query plan better in some way, and if so can you share it? Or are you seeing faster execution times in PG?

1 Like

Hey @mgartner,

thank you very much for your response!

Is your main concern with the performance of the query or the cost listed in the EXPLAIN (OPT, VERBOSE) output?

Performance. Until now the costs output was very reliable indicator for performance implications. The above query is the first query with such discrepancy.

I’ll point out that your most efficient query returns different results than the simpler query. I believe the simpler one is actually incorrect.

Yeah, you are right, I overlooked that. Thanks for pointing it out.

If you’re still concerned about it’s performance and you think it should execute faster, I might be able to help further if you sent a statement bundle which can be gathered with EXPLAIN ANALYZE (DEBUG) ... .

Yes, of course, is there any way to do it more privately?
(BTW: EXPLAIN ANALYZE (DEBUG) created an empty bundle. Only after I did it through UI and executed the query it created meaningful bundle.)

Usually, the exists check is very optimized in postgres.

I’m also curious what you mean by this. Is PG’s query plan better in some way, and if so can you share it? Or are you seeing faster execution times in PG?

I’m playing with hasura, which doesn’t support CRDB. Hasura uses WHERE EXISTS (SELECT 1 FROM very highly for row-level-permissions checks. I did some non-educated benchmarks in the past and found, that it’s very optimized. From my imagination:

  • There are only 8240 vouchers
  • All of redemptions are locked or redeemed
  • Therefore, I assume that looking at 8240 redemptions in worst case should be enough to find vouchers with at least one locked or redeemed redemption
  • This shouldn’t be a big deal for a database
  • (Not important: I’m not sure why ix_loyalty_voucher_redemption__anton_status_vouchercode index is used at all, it seems like the db reads all rows, which is equal to full table scan)

BTW: I think we experience some problems with slow disks, but I don’t think that it’s related to the huge execution time for this query.

Yes, of course, is there any way to do it more privately?

You can send it privately to me via this link: Cockroach Labs

I did some non-educated benchmarks in the past and found, that it’s very optimized. From my imagination:

Interesting… I’m hoping the statement bundle answers some questions here. I might also try to recreate tables with similar vouchers and redemptions and observe the performance.

Hey @mgartner , no big deal if not, did you saw my bundle?

I received the statements bundle you sent. Thanks for providing it.

This looks to be the most efficient query plan to me. The most expensive part is scanning the redemptions table to gather the 21 distinct vouchercode values. Like you mentioned this is basically a full table scan over all ~200k redemptions because every redemption is either locked or redeemed.

I assume that looking at 8240 redemptions in worst case should be enough to find vouchers with at least one locked or redeemed redemption

This assumption is not correct. We must look at all redemptions to find all vouchers with at least one locked or redeemed redemption. If we stopped scanning after looking at 8240 redemptions we would likely miss many vouchers and produce an incorrect result. Maybe you’ve confused this with an alternative query plan that would perform a lookup into redemptions for each of the 8240 vouchers? Our optimizer explores this option, but so many lookups is less efficient than scanning the entire redemptions table.

The only thing I can think of that would make this scan faster is a loose index scan (sometimes called an index skip scan). CRDB doesn’t currently support this, but neither does Postgres.

If you’re able to get this query to execute orders of magnitude faster in another database with the same dataset, I’d love to know. And if you could share the query plan from that database, that would be much appreciated.

Let me know if I can answer any further questions.

@mgartner again, thank you very much for looking at it!

Probably the latter one. I can imagine that this is the same challenge like with distinct on partial columns.

Thanks, I hope to find time for that. It’s very interesting.