Group by query is slow

I have a 400+ million rows table, which has a year_id column which has only 3 unique years. I did a count(*), and group by year_id. The query time is 4x slower than our MySQL instance. I didn’t have any index on the table and the DB is ran on a 8 node cluster each in a docker container.

Is it normal?

Hi! Could you run these simpler queries and see how they compare to the MySQL instance (and to the original query)?

SELECT COUNT(*)
SELECT COUNT(year_id)
SELECT 1 GROUP BY year_id

It would give us an idea if there’s anything going on for that particular query.

Correction, 3x not 4x slower. Still slow. :frowning:

SELECT COUNT(*) : 19m 33s
SELECT COUNT(year_id) : 28m 16s
SELECT 1 GROUP BY year_id : 29m 45s

SELECT year_id, count(*) GROUP BY year_id : 30m 5s

Hi @wilfred! We’ve focused on OLTP performance so far and large COUNT(*)s like this are a bit more like OLAP. In that respect, it’s not bad that we’re already within 3x! There’s a bunch of well-known techniques that can be used to speed this sort of thing up and we haven’t really done any of them yet. Performance is a big focus of ours and we’ll definitely start optimizing OLAP after we get OLTP performance polished.

Hi Wilfred,
I had the same issue and found out one workaround.
I’m working with 2.0.0 version and on count(*) (including with group by) queries it should do index only scans(don’t access value).
But as I tested it load values too. Also by default all values are packed to one “primary” family and it has to load all rows’ data while do index scan. Not sure is it expected behavior or not.
Workaround is to create secondary index on id(or any small field) field:
CREATE INDEX id_idx ON table1(id);

and try hint this index:
SELECT COUNT(*) FROM table1@id_idx;

As for index row it stores sentinel key with value of 0. So it runs much faster and don’t use much memory.
Hope it helps!

Thanks Andrey,

Will give it a try. Fyi. I loaded a few tables with millions rows. Those tables when I did a simple count() without group by, performance is very comparable with MySQL. Only for those tables which have the year_id, count() and group by it is slow.

You are welcome! Could you show structures of your tables? (show create table … ).
This workaround will be effective if you have many columns with significant data.

CREATE TABLE test_table_type_one (
measure1_id BIGINT NOT NULL,
year_start_id BIGINT NOT NULL,
year_end_id BIGINT NOT NULL,
measure2_id BIGINT NOT NULL,
measure3_id BIGINT NOT NULL,
measure4_id BIGINT NOT NULL,
measure5_id BIGINT NOT NULL,
data_type_id BIGINT NOT NULL,
value DOUBLE PRECISION NOT NULL,
high_value DOUBLE PRECISION NULL,
low_value DOUBLE PRECISION NULL,
INDEX comp_idx (measure2_id ASC, measure5_id ASC, measure4_id ASC, year_start_id ASC, year_end_id ASC, data_type_id ASC, measure1_id ASC, measure3_id ASC),
FAMILY “primary” (measure1_id, year_start_id, year_end_id, measure2_id, measure3_id, measure4_id, measure5_id, data_type_id, val, upper, lower, rowid)
);

Count(*) on this table is comparable with MySQL.

The other type of table on which the group by is slow, has year_start_id and year_end_id replaced by year_id. Other columns are the same.

The fat index covers all columns our queries will join them by.

Please try this on both tables:
CREATE INDEX year_idx ON test_table_type_one(year_start_id);
CREATE INDEX year_idx ON test_table_type_two(year_id);

SELECT COUNT(*) FROM test_table_type_one@year_idx;

SELECT COUNT() FROM test_table_type_two@year_idx;
SELECT year_id, COUNT(
) FROM test_table_type_two@year_idx GROUP BY year_id;

Thanks Andrey, been over 24 hours index on the second table is still creating.

Will post my result after the indexes are created.

After the index is created on year_id, query time is faster by ~5mins. 33.58 vs 28.33 mins

CREATE INDEX

Time: 26h47m38.692927867s