Apply-join and hash-join difference

Hi. Can anyone explain what is the difference between apply-join , hash-join and how CockroachDB choses what to use?
Here is my case:
I have two tables users and user_reactions.

create table users
(
	user_id text not null
		constraint user_id_pk
			primary key,
	cell_id text not null
);

create index cell_id_user_id_index
	on users (cell_id, user_id);
create table user_reactions
(
	user_from text not null
		constraint user_from_users_user_id_fk
			references users
				on delete cascade,
	user_to text not null
		constraint user_to_users_user_id_fk
			references users
				on delete cascade,
	reaction text not null,
	constraint user_from_user_to_pk
		primary key (user_from, user_to),
	constraint user_to_user_from_index
		unique (user_to, user_from)
);

create index user_from_reaction_index
	on user_reactions (user_from, reaction);

create index user_to_reaction_index
	on user_reactions (user_to, reaction);

I want to select all users from table users who match my cell_id range and didn’t receive ‘like’ or ‘dislike’ reaction from the current user;
ID of the current user is ‘a’ in this example.
Here is my query:

explain (verbose)
select *
from users@cell_id_user_id_index u
where (
        (u.cell_id between '2' and '5') and
        u.user_id != 'a' and
        u.user_id not in
        (
            select user_to as user_id
            from user_reactions@user_from_reaction_index
            where user_from = 'a' and (reaction = 'like' or reaction = 'dislike')
        )
    );

The output of analyzer:

""	distributed	true
""	vectorized	false
hash-join	""	""
 │	type	anti
 │	equality	(user_id) = (user_to)
 │	left cols are key	""
 │	right cols are key	""
 ├── scan	""	""
 │	table	users@cell_id_user_id_index
 │	spans	/"2"-/"5"/PrefixEnd
 │	filter	user_id != 'a'
 └── scan	""	""
""	table	user_reactions@user_from_reaction_index
""	spans	/"a"/"dislike"-/"a"/"dislike"/"a" /"a"/"dislike"/"a\x00"-/"a"/"dislike"/PrefixEnd /"a"/"like"-/"a"/"like"/"a" /"a"/"like"/"a\x00"-/"a"/"like"/PrefixEnd
""	filter	(reaction = 'like') OR (reaction = 'dislike')

It uses the hash-join to filter out by user_id from the subquery and the distributed flag is true.

If I change my query a bit, for example to also exclude users who sent ‘like’ or ‘dislike’ reaction. So I will add another subquery and union them:

explain (verbose)
select *
from users@cell_id_user_id_index u
where (
        (u.cell_id between '2' and '5') and
        u.user_id != 'a' and
        u.user_id not in
        (
            (
                select user_to as user_id
                from user_reactions@user_from_reaction_index
                where user_from = 'a' and (reaction = 'like' or reaction = 'dislike')
            )
            union
            (
                select user_from as user_id
                from user_reactions@user_to_reaction_index
                where user_to = 'a' and (reaction = 'like' or reaction = 'dislike')
            )
        )
    );

The out of analyzer:

""	distributed	false	""
""	vectorized	false	""
apply-join	""	""	(user_id, cell_id)
 │	type	anti	""
 └── scan	""	""	(user_id, cell_id)
""	table	users@cell_id_user_id_index	""
""	spans	/"2"-/"5"/PrefixEnd	""
""	filter	user_id != 'a'	""

First of all it doesn’t display any info about subqueries, I hope it uses the user_from_reaction_index and ‘user_to_reaction_index’ indexes and just combines two unions. Secondly, now it’s the apply-join and the distributed flag is false.

In my real workload I expect something around 100K rows in users table after filtering by cell_id and 5K rows in the subquery from user_reactions. What CockroachDB will chose to use in that case? Is it always have to be the apply-join if I union two subqueries?

Also, what if I do it as two separate queries from my application:
Firstly, I extract all user ids from user_reaction who received or sent ‘like’ or ‘dislike’ reaction from/to the current user:

(
    select user_to as user_id
    from user_reactions@user_from_reaction_index
    where user_from = 'a' and (reaction = 'like' or reaction = 'dislike')
)
union
(
    select user_from as user_id
    from user_reactions@user_to_reaction_index
    where user_to = 'a' and (reaction = 'like' or reaction = 'dislike')
);

And then, pass the result as a list to my main query in users table:

explain (verbose)
select *
from users@cell_id_user_id_index u
where (
        (u.cell_id between '2' and '5') and
        u.user_id != 'a' and
        u.user_id not in ('b', 'c', 'd', 'f')
    );

The output of the analyzer:

""	distributed	true
""	vectorized	false
scan	""	""
""	table	users@cell_id_user_id_index
""	spans	/"2"-/"5"/PrefixEnd
""	filter	(user_id != 'a') AND (user_id NOT IN ('b', 'd', 'f'))

The distributed flag is true.

Will this approach, when I do the subquery from the application, perform the same as the hash-join on my real workload?

Hi @georgysavva,

We discussed this over slack, but I thought I’d paste the summary here so others reading this post can benefit.

Regarding hash joins v. apply joins:

The optimizer creates an apply join when it is not possible to de-correlate a correlated subquery in the original query. In this case, it converts the correlated subquery into a join in which the right side of the join may reference variables from the left side.

Apply joins are inefficient because they must be executed one row at a time. The left side row must be used to construct the right side row, and only then can the execution engine determine if the two rows should be output by the join. This corresponds to an O(n*m) time complexity.

Hash joins are much more efficient. A hash table is constructed using rows from the smaller side of the join, and then the larger side of the join is used to probe into the hash table using the ON conditions of the join. This corresponds to an O(n+m) time complexity.

We have a number of transformation rules we use to try to decorrelate subqueries, since apply joins are almost always less efficient than other alternatives. As of this writing, we will always try to decorrelate subqueries if we can, regardless of the number of rows, but we cannot cover all cases. If you see an apply join, it means we were not able to perform decorrelation, and you should probably try to rewrite your query in a different way in order to get better performance.

Regarding the example in this post:

As we discussed in Slack, for this particular query, rewriting

user_id not in ((<suqbquery 1>) union (<subquery 2>))

to

user_id not in (<suqbquery 1>) and user_id not in (<subquery 2>)

allowed the optimizer to create a query plan with hash joins instead of apply joins. This is also a better alternative than the suggestion at the bottom of the post to perform the query in two steps, since the plan at the bottom shows an inefficient filter condition resulting in O(n*m) execution time overall.

1 Like