One to many and many to many solution for best performance

We are searching the best solution for one to many and many to many relations in CockroachDB. We got relatively large tables (~3 billion rows -> in our Messages table).

One to many: we have to list parents (threads) where child (messages) rows match the query. A single join costs >100-300ms for our tables:
select parent.* from parent join child on parent.id = child.parentId where child.a = 1 and child.b = 2 etc…
(Right now they are not interleaved tables, but as I read the documents, interleaved table joins somehow have a little performance hit.)
As we testing it, sometimes two separate queries have better performance than a join. Is there a better join solution for it?

Many to many: we have to add labels to messages. (Label * <—> * Message)

  1. There’s of course a many-to-many join table solution, but your JOIN implementation hasn’t got the best performance yet (or we do it the wrong way).

  2. The ARRAY type would be a good solution, if we could index an ARRAY field, but we can’t.

  3. We could use your JSONB datatype in beta version, but we have to filter for normal columns and this JSONB column as well. But CockroachDB can use 1 index for a query so we can’t do:
    select … where intColumn = 2 and jsonbColumn ?& array[‘label1’, ‘label2’]

2 together: there are queries where we would like to query only the parents, where their children have to match different filters including a many to many query. If we have the best solutions for the first two, then of course we can mix them.

Are there any best practice or recommendation for these cases?
Thanks for any help!

Just one more thing: the main problem with the joins (if I see it right) is the following:
I would like to join the Messages table to the Threads table:

select t.id, t.subject from messages as m join threads as t on (t.id = m.threadId) where m.accountId = 23 and m.folderId = 233

According to the query plan firstly we filter the messages and this is fast, because we have an index for the queried columns (accountId, folderId). Ok, we have the needed rows from Messages table (including the threadId column). But after that the plan shows a full table scan (spans ALL) on Threads table, so it won’t use the primary key index to get the Threads rows. I found a discussion somewhere, where one of your colleague wrote that you don’t propagate filters in joins. Is this the case? Will it change?

Of course when I add where conditions to the Threads table too, it will be (ten times) faster, because I narrow the Threads results and it has to join less rows:

select t.id, t.subject from messages as m join threads as t on (t.id = m.threadId and t.accountId = 23) where m.accountId = 23 and m.folderId = 233

It is possible that a lookup join would improve the performance of this kind of query. We just (https://github.com/cockroachdb/cockroach/pull/22674) added support for those in the most recent alpha (https://www.cockroachlabs.com/docs/releases/v2.0-beta.20180305.html). You can enable them by running set experimental_force_lookup_join = true in whatever SQL session you wish to enable this on. In a future release we will have an optimizer that does this automatically. If you are able to test on that alpha release we’d be interested in hearing if that improves the performance.

Thank you Matt! I will test it with the new beta and I’ll write a feedback here with the results!