TopN rollup aggregation

sql

#1

I am thinking about using CRDB to store time series data which would contain a numerical value for an interval. I need to be able to query for the TopN or BottomN aggregated over different timespans such as a minute, hour, day, month, year. I could probably use the actual data and sum the values over a smaller timespan like a minute/hour, but I the larger time spans would need to be created as a rollup table. I am just wondering what the most efficient method for implementing a counter column for that rollup would be?


(matt) #2

Hi mike,

I might need you to clarify what you mean by efficiency: do you want efficiency in terms of computation, or in terms of storage?

In storage terms, I might suggest using an ARRAY field for TopN/BottomN rollups like this:

CREATE TABLE topbottomn_1day (
 timestamp TIMESTAMP NOT NULL PRIMARY KEY,
 topN INT[],
 bottomN INT[]
);

INSERT INTO topbottomn_1day (timestamp, topn, bottomn) 
VALUES (<timestamp>, ARRAY[7000, 6000, 5000], ARRAY[-800, -500, -300]);

You can read more about CockroachDB arrays in our documentation.

With this method, there’s no hard limit on the number ‘N’ you can store, if it needs adjusting later. However, keep in mind that once you roll-up, you’ll be setting a maximum value of N - future queries won’t be able to get the Top(N+1) from the rollup.


#3

I wasn’t very specific, let me clarify. I would have a table and insert like the following:

CREATE TABLE rollup (
   interval TIMESTAMP,
   interval_type String,
   user String,
   total Int,
   PRIMARY KEY (interval, interval_type, user)
);

INSERT INTO rollup (interval, interval_type, user, total) 
VALUES ("2018-01-01 00:00:00", <"min"|"hour"|"day"|"month"|"year">, "user", <value>) 
ON CONFLICT (interval, interval_type, user) 
DO UPDATE SET total = total + <value>;

Then I would be able to query the specific interval by type and user and sort by total to get the TopN and BottomN. In this instance I would like to know how the cluster would handle the constant updating of the total column, and if this is the best way to partition this data.

It’s essentially a counter column, which I am not sure will be efficient or if you’ve already implemented this in another fashion.