Create inverted index on select subfields only?


(Matt) #1

I’m migrating our telemetry database from RethinkDB to CockroachDB, and am having some performance problems.

Our telemetry data is stored as JSON payloads which do not adhere to a rigid structure. Rethink was perfect for this, because we could create indexes on specific fields within the JSON document, rather than the whole thing. Some JSON documents may be up to 10 MB large, although most are less than 500 KB.

I’m finding that writes larger than about 500 KB are taking minutes, even though only a single column is written – the JSON document – for up to 100 or so rows.

By removing the inverted index on that field, however, writes accelerated to < 50ms.

I only need to index a few fields within the large JSON structure – for queries on the rest, which are very very rare, I can accept a table-wide scan. Is creating such an index possible with Cockroach?


(Jesse) #2

Hi @mholt,

What you’re seeing - writes get slower with an inverted index - is expected, but there is a potential way forward. Instead of putting the inverted index on the JSONB column, you could create a computed column that contains just the fields of the JSONB column that you query often and then create an inverted index on that computed column. Would you be willing to try that out?

Best,
Jesse


(Matt) #3

Sure. I already do that for a couple of STRING fields, I just haven’t tried it for JSON objects yet. I imagine that’ll mean the data is duplicated within the DB then? Or is Cockroach smart enough to de-duplicate it?


(Jesse) #4

Yes, computed columns are “stored” currently, which means the data is duplicated. We may add support for non-stored computed columns in the future, though.


(Ben Darnell) #5

The underlying sstable storage is compressed, so even though the “stored” computed column is duplicated, the marginal storage cost should be small. All indexes involve extra copies of the data, though, so there will be an extra copy of the indexed data (inverted index involve lots of duplication of the whole json objects; with an index on a computed column you only pay for the parts in the computed column)


(Raphael 'kena' Poss) #6

An alternative is to modify the JSON objects upfront (client-side before the writes) to place the indexed fields into separate SQL columns.

Then you can use regular SQL indexes on those columns.

After all you have SQL now. You can use multiple columns and place NULL into the SQL columns when the JSON field is not initialized. CockroachDB uses no storage to store a NULL value.


(Matt) #7

Great – thanks, both, for the answers! That’s basically what I’m doing at this point, no more inverted index.