Histograms, NULL values? Statistics gathering

Hi!

I would like to ask about following.
Why histogram columns have NULL values? (Tables below.)
Could someone describe when values should come to these histogram columns?

root@<ip>:<port>/test> select * from system.table_statistics;

tableID | statisticID | name | columnIDs | createdAt | rowCount | distinctCount | nullCount | histogram
±--------±-------------------±---------±----------±---------------------------------±---------±--------------±----------±----------+
53 | 481736537047433217 | auto | {1} | 2019-08-29 13:18:44.042574+00:00 | 12288 | 12141 | 0 | NULL
53 | 481736537067716609 | auto | {2} | 2019-08-29 13:18:44.042574+00:00 | 12288 | 0 | 12288 | NULL
53 | 481736537089933313 | auto | {3} | 2019-08-29 13:18:44.042574+00:00 | 12288 | 1 | 0 | NULL
54 | 481740882923192321 | auto | {1} | 2019-08-29 13:40:50.298704+00:00 | 335872 | 332733 | 0 | NULL
54 | 481740882971590657 | auto | {5} | 2019-08-29 13:40:50.298704+00:00 | 335872 | 332179 | 0 | NULL
54 | 481740883050758145 | auto | {2} | 2019-08-29 13:40:50.298704+00:00 | 335872 | 0 | 335872 | NULL
54 | 481740883129368577 | auto | {3} | 2019-08-29 13:40:50.298704+00:00 | 335872 | 1 | 0 | NULL
54 | 481741081969360897 | auto | {1} | 2019-08-29 13:41:51.040138+00:00 | 376832 | 373160 | 0 | NULL
54 | 481741082014253057 | auto | {5} | 2019-08-29 13:41:51.040138+00:00 | 376832 | 370962 | 0 | NULL
54 | 481741082053804033 | auto | {2} | 2019-08-29 13:41:51.040138+00:00 | 376832 | 0 | 376832 | NULL
54 | 481741082110066689 | auto | {3} | 2019-08-29 13:41:51.040138+00:00 | 376832 | 1 | 0 | NULL
54 | 481741281433452545 | auto | {1} | 2019-08-29 13:42:51.91464+00:00 | 417792 | 412568 | 0 | NULL
54 | 481741281481719809 | auto | {5} | 2019-08-29 13:42:51.91464+00:00 | 417792 | 412171 | 0 | NULL
54 | 481741281531428865 | auto | {2} | 2019-08-29 13:42:51.91464+00:00 | 417792 | 0 | 417792 | NULL
54 | 481741281571143681 | auto | {3} | 2019-08-29 13:42:51.91464+00:00 | 417792 | 1 | 0 | NULL
54 | 481741480032993281 | auto | {1} | 2019-08-29 13:43:52.522046+00:00 | 454656 | 447960 | 0 | NULL
54 | 481741480070938625 | auto | {5} | 2019-08-29 13:43:52.522046+00:00 | 454656 | 451204 | 0 | NULL
54 | 481741480110456833 | auto | {2} | 2019-08-29 13:43:52.522046+00:00 | 454656 | 0 | 454656 | NULL
54 | 481741480154300417 | auto | {3} | 2019-08-29 13:43:52.522046+00:00 | 454656 | 1 | 0 | NULL
54 | 481741876549255169 | auto | {1} | 2019-08-29 13:45:53.529235+00:00 | 532480 | 529545 | 0 | NULL
54 | 481741876587888641 | auto | {5} | 2019-08-29 13:45:53.529235+00:00 | 532480 | 528832 | 0 | NULL
54 | 481741876625047553 | auto | {2} | 2019-08-29 13:45:53.529235+00:00 | 532480 | 0 | 532480 | NULL
54 | 481741876667383809 | auto | {3} | 2019-08-29 13:45:53.529235+00:00 | 532480 | 1 | 0 | NULL
(23 rows)

Time: 5.982595ms

root@172.21.0.4:26257/test> show statistics for table abcd_tbl;
statistics_name | column_names | created | row_count | distinct_count | null_count | histogram_id
±----------------±-------------±---------------------------------±----------±---------------±-----------±-------------+
auto | {dn} | 2019-08-29 13:40:50.298704+00:00 | 335872 | 332179 | 0 | NULL
auto | {ts} | 2019-08-29 13:40:50.298704+00:00 | 335872 | 0 | 335872 | NULL
auto | {person} | 2019-08-29 13:40:50.298704+00:00 | 335872 | 1 | 0 | NULL
auto | {id} | 2019-08-29 13:40:50.298704+00:00 | 335872 | 332733 | 0 | NULL
auto | {ts} | 2019-08-29 13:41:51.040138+00:00 | 376832 | 0 | 376832 | NULL
auto | {dn} | 2019-08-29 13:41:51.040138+00:00 | 376832 | 370962 | 0 | NULL
auto | {person} | 2019-08-29 13:41:51.040138+00:00 | 376832 | 1 | 0 | NULL
auto | {id} | 2019-08-29 13:41:51.040138+00:00 | 376832 | 373160 | 0 | NULL
auto | {person} | 2019-08-29 13:42:51.91464+00:00 | 417792 | 1 | 0 | NULL
auto | {id} | 2019-08-29 13:42:51.91464+00:00 | 417792 | 412568 | 0 | NULL
auto | {dn} | 2019-08-29 13:42:51.91464+00:00 | 417792 | 412171 | 0 | NULL
auto | {ts} | 2019-08-29 13:42:51.91464+00:00 | 417792 | 0 | 417792 | NULL
auto | {id} | 2019-08-29 13:43:52.522046+00:00 | 454656 | 447960 | 0 | NULL
auto | {dn} | 2019-08-29 13:43:52.522046+00:00 | 454656 | 451204 | 0 | NULL
auto | {ts} | 2019-08-29 13:43:52.522046+00:00 | 454656 | 0 | 454656 | NULL
auto | {person} | 2019-08-29 13:43:52.522046+00:00 | 454656 | 1 | 0 | NULL
auto | {id} | 2019-08-29 13:45:53.529235+00:00 | 532480 | 529545 | 0 | NULL
auto | {dn} | 2019-08-29 13:45:53.529235+00:00 | 532480 | 528832 | 0 | NULL
auto | {ts} | 2019-08-29 13:45:53.529235+00:00 | 532480 | 0 | 532480 | NULL
auto | {person} | 2019-08-29 13:45:53.529235+00:00 | 532480 | 1 | 0 | NULL

select count(*) from abcd_tbl; count
±-------+
581632

Version of CockroachDB which was in use:
v19.1.4

I have seen following page:

Hey @roachman

Currently, this is expected behavior as the histograms aren’t collected in 19.1. The blog post does point out that “… In the next release we will incorporate histograms into our estimates to handle such cases”. The Cost Based Optimizer can use this table to be able to further estimate what the best query is to maximize performance. These are described here.

Let me know if you have any other questions.

Ok, thank you for the information!

One question, which I would have liked to investigate, is following kind of.
( I cannot see the histograms from CRDB, so I cannot check following by myself currently there.)

We have couple of alternative ways to read all children of a certain parent (child tree).
Such could be achieved with different means. (“With recursive”, with “materialized path”, etc.)

Following tries to use a “materialized path”. But there is a problem with an RDBMS.

When using an other RDBMS, following data has caused problems
(all rows can start with “CDE-CDE”, in DB):

CDE-CDE/A-2/B/D/G…
CDE-CDE/D-56/G…
CDE-CDE/A-654/B2/D/G…
CDE-CDE/A-654/B2/D2/G4…
CDE-CDE/A-654/B2/D3/G5…
CDE-CDE/B-45/F4/…
CDE-CDE/E-33/…
CDE-CDE/C-22/…

An index is created on top of such column and data. In SQL query, such index should be tried to be used.

With that other RDBMS (not CockroachDB), there comes problems with statistics gathering and histograms.
In statistics only the first part was taken in account: “CDE-CDE”, from each row.
The rows do not get distributed in the statistics.
The RDBMS thinks that all rows belong to same bar in a graph.

So, the execution plan will be / can be very bad. The DBMS is not able to make proper execution plan. It may e.g. start to read with with full index scan / full data scan. (Incorrect.)

It should use the index, to read only a child tree from the index. (Correct behavior.)

Does someone know, could this kind of a problem / similar happen with CockroachDB?
(With statistics gathering, histograms.)

(I, or anyone, could open an other discussion chaing for this other question, if we want. )

Hey @roachman

I would say that the biggest factor in determining how the query would run is to determine the definition of the table, and how the Index is configured. These factors will determine whether or not a Full table scan would need to run. We discuss the best practices of discussing indexes here. This will work in tandem with histograms and statistic gathering to speed up queries.

Take a look and let me know if there is any other questions.

Cheers,
Ricardo