Hi Team,
It will be a long post . We are starting with a project where we need to migrate from SQL DB to distributed Db which is easy to scale. We were looking at Cassandra or Cockroach Db. We have a few tables which are being queried using almost 15 other columns apart from the
primary key
(We still need to figure out if we can reduce the number of columns). Since it is SQL if there are no indexes, still it is still possible. To do it in a distributed database, the first thing comes to mind putting secondary indexes for all these columns.
We got to know that Cassandra is not so great with secondary indexes. Reading through docs we figured out it is because for every secondary index it keeps a reverse map at every node. On query, it needs to go to all the nodes to find the reverse map entries where that particular secondary index value is used as a key.
We were reading Cockroach Db articles Mapping Table Data to Key-Value Storage & Index selection in CockroachDB to know how the indexes work. It turns out that it is better because the Indexes are just sorted keys which can be scanned directly.
BTW both of the articles never talk about what will happen in case of when that secondary index values are stored across nodes.
Suppose this is the state of tables. ID
is the primary Index and we have StrVal
column as secondary Index:
Id | floatVal | StrVal
10 | 4.5 | foo #stored on node1, Seconday Index will be on node 1
120 | 10 | foo #stored on node2, Seconday Index will be on node 2
130 | 11 | bar #stored on node2, Seconday Index will be on node 2
240 | 12 | foo #stored on node3, Seconday Index will be on node 3
If we search for a primary key e.g. 10 it can easily figure out because of ordered ranges. The first range will point to the first node and it will go to that node and fetch it.
If I search with the secondary index e.g. where StrVal='foo'
, Doesn’t it need to go to every node and do a Songs@StrValIndex/"foo"-/"foo\x00"
and get the primary keys for all. It will be a bit faster than Cassandra since it needs to scan just the chunk where it matches the foo
since it is sorted. Apart from that, it is the same. It needs to go to all number of nodes and scan.
- Is there anything We are missing in terms of how scanning of secondary indexes works?
- Is there a limitation in terms of a number of columns we should be creating a secondary index on?
Please let us know.
Thanks