How do Cockroach Db secondary indexes perform better than of Cassandra

Hi Team,

It will be a long post :frowning:. We are starting with a project where we need to migrate from SQL DB to distributed Db which is easy to scale. We were looking at Cassandra or Cockroach Db. We have a few tables which are being queried using almost 15 other columns apart from the primary key (We still need to figure out if we can reduce the number of columns). Since it is SQL if there are no indexes, still it is still possible. To do it in a distributed database, the first thing comes to mind putting secondary indexes for all these columns.

We got to know that Cassandra is not so great with secondary indexes. Reading through docs we figured out it is because for every secondary index it keeps a reverse map at every node. On query, it needs to go to all the nodes to find the reverse map entries where that particular secondary index value is used as a key.

We were reading Cockroach Db articles Mapping Table Data to Key-Value Storage & Index selection in CockroachDB to know how the indexes work. It turns out that it is better because the Indexes are just sorted keys which can be scanned directly.

BTW both of the articles never talk about what will happen in case of when that secondary index values are stored across nodes.
Suppose this is the state of tables. ID is the primary Index and we have StrVal column as secondary Index:

Id | floatVal | StrVal
10 | 4.5 | foo    #stored on node1, Seconday Index will be on node 1
120 | 10 | foo   #stored on node2, Seconday Index will be on node 2
130 |  11 | bar  #stored on node2, Seconday Index will be on node 2
240 | 12 | foo   #stored on node3, Seconday Index will be on node 3

If we search for a primary key e.g. 10 it can easily figure out because of ordered ranges. The first range will point to the first node and it will go to that node and fetch it.

If I search with the secondary index e.g. where StrVal='foo', Doesn’t it need to go to every node and do a Songs@StrValIndex/"foo"-/"foo\x00" and get the primary keys for all. It will be a bit faster than Cassandra since it needs to scan just the chunk where it matches the foo since it is sorted. Apart from that, it is the same. It needs to go to all number of nodes and scan.

  • Is there anything We are missing in terms of how scanning of secondary indexes works?
  • Is there a limitation in terms of a number of columns we should be creating a secondary index on?

Please let us know.
Thanks

Hi @mdubey,

Thanks for the question.

  • Is there anything We are missing in terms of how scanning of secondary indexes works?

I think your confusion lies with the idea that secondary indexes for a row need to be stored on the same node as that row’s primary index. That is not the case. Instead, secondary indexes are stored just like primary indexes and are addressed in the same way. For instance, in your example, all rows with the foo StrVal would likely have their secondary index rows stored on the same node. This means that a search on the secondary index can also easily locate the data because of the ordered ranges.

The tradeoff here is that a secondary index lookup may need to reference the primary key if the secondary index alone doesn’t have all of the information. Because the secondary index rows are not always stored on the same node as the corresponding primary index rows, this may require cross-node coordination. Again though, the ordered ranges mean that we know exactly which node to look at.

You can ensure that a secondary index will have enough information to satisfy a query using STORING columns (see https://www.cockroachlabs.com/docs/v2.0/indexes.html#storing-columns).

  • Is there a limitation in terms of a number of columns we should be creating a secondary index on?

There is no hard limit, although we recommend keeping the total size of the secondary index key below 128 KB for optimal performance.

Can you explain your data model and query patterns ?

Hi @ddorian43,

We are yet to go through all the data models. We were told the use case and we were assessing the different type of DBs which could fit for best need. Here is an example:-

We have got an orders table:-

order_id int primary key
customer_id int 
product_id int
vendor_id int
notes varchar

On this possible queries could be:

Lookup by primary index: getByOrderId
Lookup by secondary index: getByCustomerIds, getByVendorIds, getByCustomerAndVendorIds
No index : getByProductIds

There will be similar tables with more columns. Let me know If you need more information.