We’re investigating replacing some of smaller HBase workloads with CockroachDB.
The HBase data model is very similar to the underlying CDB data model, so we’re trying to implement the HBase model on top of CDB with the hopes of minimal impact on performance and the obvious added benefit of things like consistent secondary indexing.
At this point, we’re able to replicate the HBase model using a table structure like the following:
CREATE TABLE kv( key BYTES, qualifier BYTES, timestamp TIMESTAMP, value BYTES, PRIMARY KEY (key,qualifier,timestamp DESC) );
This would map to a single HBase column family.
Different families would map to different tables (this creates some challenges, but would probably be fine for our current use-cases).
HBase allows multiple versions of the same value through the use of different timestamps.
When reading data out of HBase, the client may specify how many versions it wants to read back (usually this is 1, the latest value).
We’ve worked out that the following SQL statement provides those semantics:
SELECT row_number() OVER w AS version, key, qualifier, timestamp, value FROM kv WHERE version <= n AND key = xxx WINDOW w AS (PARTITION BY key,qualifier ORDER BY timestamp DESC)
WINDOW is partitioning over the table’s primary key, so, in theory, this should be a simple scan + filter.
This seems to be confirmed by
EXPLAIN (though the
window node is a bit opaque):
tree | field | description +----------------+-------+---------------+ window | | └── render | | └── scan | | | table | kv@primary | spans | ALL
This seems fine, but for the scanning use case (scanning over a key range), we also need things to be returned in primary key order.
This can be fixed by adding an
ORDER BY clause:
SELECT row_number() OVER w AS version, key, qualifier, timestamp, value FROM kv WHERE version <= n AND key >= xxx AND key < yyy WINDOW w AS (PARTITION BY key,qualifier ORDER BY timestamp DESC) ORDER BY PRIMARY KEY kv
Unfortunately, this introduces a global sort in the plan:
tree | field | description +---------------------+-------+------------------------------+ sort | | │ | order | +key,+qualifier,-"timestamp" └── window | | └── render | | └── scan | | | table | kv@primary | spans | ALL
This global sorting shouldn’t be necessary since all we’re doing is scanning the primary key.
Even in distributed mode, it should be possible to return rows in primary key order without resorting to a global sort since every node should be simply be scanning the primary key.
WINDOW statement preventing the planner to figure this out?
It seems like it should be possible to implement this as a simple scan of the primary key, but the SQL statement is getting in the way somewhat.
Perhaps I’m missing something that is making this hard / impossible?