There are two table
orders is the smaller table which have 70,000,000 rows,
lineitem is the bigger table which have 3,000,000,000 rows.
The cluster run in 5 machines (with enough cpu/mem/disk/network resource) using the default cluster setting.
the query is like:
SELECT l_orderkey, o_orderdate, o_shippriority FROM orders, lineitem WHERE (( (l_orderkey = o_orderkey)) AND (o_orderdate < DATE '1995-03-10')) AND (l_shipdate > DATE '1995-03-10') GROUP BY l_orderkey, o_orderdate, o_shippriority;
explain analyze xxxxx and find that memory is not enough and it will spill to disk.
I wonder how does cockroachDB handle this situation?
I fount some information on the Internet. Some database may split two tables by the same hash function, one part is in memory and the other part is in disk, then merge the two parts of the results. (or Several other algorithms like ‘grace hash join’,‘hybrid hash join’)
I fount the function
func maybeSpillToDisk(err) (in pkg/sql/distsqlrun/hashjoiner.go; version 2.1.2) which will spill
h.rows[h.storedSide] to disk and use
hashDiskRowContainer as h.storedRows.
I am confused about that will CockroachDB split hashmap (or hashContainer?) into several parts and some part in memory and the other on disk? or just using one hashmap either in memory or on disk?