A certain type of the SQLs executed by one of my programs does a lot of lookups for a ~small subset of values (from e.g. 1 to 50K values) within a master table that can contain millions of rows (e.g. 300M).
select col1, count(*) from table X where col2 in (<50k_single_explicit_values>) group by col1 order by count(*) desc limit 1
…is in general bad practice, therefore previously with MariaDB+MyRocks I used to first insert those values into a temporary table and then to execute…
select col1, count(*) from table X where col2 in (select col2b from temp_table_subprcs_X) group by col1 order by count(*) desc limit 1
I have already performed a few simple manual tests and the execution of such an SQL in CRDB was great (for data present in the block cache e.g. ~2.5 seconds in MariaDB+MyRocks vs ~1.7 seconds in CRDB) but I’m not sure about what to do about the temporary table.
So far I have partially rewritten my program (python, uses multiple sub-processes to handle data to be processed in parallel) to misuse “normal” tables (as temporary lookup storage) which are dedicated to each subprocess (still rewriting, I haven’t tested all this yet ):
- program inserts into [normal_table_having_subProcessID_in_its_name].
- program runs the above SQL and retrieves the result.
- program performs a “rollback” (to avoid any useless writes to disk for data of step #1).
I did read about the experimental temporary tables (here) but it doesn’t mention many details - especially “why” I should use it vs. “normal” tables.
In general, using such a temporary table just seems to add complexity (table disappears after disconnecting + gc activity).
Any recommendations about how to handle temporary data?
When dealing with temp data I primarily want to avoid all kinds of disk writes/reads.
Is it “kind of ok” to do it like I’m about to do it (meaning to perform an “insert” and to then roll back - e.g. not sure if CRDB has a WAL that would be put under pressure during the 2 ops? Wasn’t able to find such infos and couldn’t perform useful observations as the host that I got is remote&quite big so such “writes” might have gone unnoticed during my initial tests…).
Or would it be better to use such an experimental temporary table?
Or can a “normal” table be created so that its definition is stable but it always keeps its data in RAM and does not ever write any [data, logs, anything] to disk?
E.g. in MariaDB I used to do that by executing “create table blah engine=memory”.