Column Family: MyRocks (MariaDB) vs. CockroachDB

Hi all
I’m completely new to CockroachDB (read docs today for the first time) and MariaDB+MyRocks (trying to use it since a couple of weeks).
I read CockroachDB’s doc about Column Families (“CF”) here but I’m not sure if what I understood is correct.

Question
On one hand, in the case of MyRocks/MariaDB, what I understood is that…

  1. …a primary index always stores the data of all table columns (doesn’t matter if they’re part of the PK or not - mentioned here) => therefore, any DML will always end up touching/modifying that primary index involving all data stored in all the table’s columns, right?
    By doing a parallel with the Oracle DB, the MyRocks/MariaDB primary index seems similar to an “Index Organized Table”?
  2. …a CF can be defined only per index (mentioned here).

On the other hand, in the case of CockroachDB, I understand from what is written in the doc that the CF of the primary index does not (have to) include all non-primary-index-columns (only the ones explicitly mentioned). Am I wrong?

Usecase
I have a table that stores temporary data.
The table has multiple columns:
the first col uniquely identifies a specific URL, the other cols store the result of the work that is done at each processing step.
Separate programs (that might or might not run concurrently) are responsible to process data for their individual step.
Once all steps for a URL/document have been performed the row is deleted.
Having separate cols for each processing step simplify debugging if anything goes wrong (“aha - data looked like this during processing step #1, then it looked like that during procstep #2, then it became like that during procstep #3”).

The table definition in my MariaDB/MyRocks currently basically looks like this:

create table temp_data_tbl
(
url varbinary(1024) not null
,step1 mediumblob not null
,step2 mediumblob default null
,step3 mediumblob default null
,PRIMARY KEY (url)
)
ENGINE=RocksDB
DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
  • When program1 runs it will create the row and add data to the cols “url” (max 1KB) and “step1” (max 4 MBs).
  • When program2 runs it will update the table by adding data to the col “step2” (max 4 MBs).
  • When program3 runs it will update the table by adding data to the col “step3” (max 1.5 MBs).
  • When program4 runs it will delete the row (if it has processed it successfully).

If I understood the documentation of CockroachDB correctly, if I would use…

  • a CF for “url”+“step1”
  • another different CF for “step2”
  • and as well another different CF for “step3”

…then when e.g. “program2” would update the contents of the col “step2” only the CF of “step2” would be updated (involving a tombstone for the old version of the row a the new version of the row added - and would have to go sometimes later through the compaction phase) while all other CFs would be left untouched? (and the same would be true in relation to “program3” & the col “step3”)

Background infos
I’m trying to write an app and until a few months ago I was happy with MariaDB+TokuDB.
Unluckily TokuDB will be dropped by MariaDB&PerconaDB => trying to find a replacement:

  • excluded MariaDB’s InnoDB & Columnstore (multiple reasons).
  • already using Clickhouse for another part of the workload but it’s currently not too ideal to deal with this kind of table (feels like “too many hacks” for frequent DML ops (might change in the future as they might implement kind-of-atomic update/delete ops).
  • struggling with this workload with MyRocks on MariaDB as compactions, especially when dealing with “step3” of the above example, do take a long time (a lot of I/O) - my wild guess is that during those compaction phases the DB is dealing as well with the data stored in all other columns and not only with the data stored in “step3”.

Thank you :slight_smile:

I think your understanding is correct. However, 4MiB is quite large for a cockroach row. I don’t think anything will explode, and, if it’s usually much smaller than that, I think you’ll be okay. One thing to note is that cockroach cannot split ranges within a single row. If the garbage for a given row ends up being more than 1GiB then you’ll experience backpressure.

I think I’d separate the data out and key it on the hash of the URL in separate table. Big rows scare me. I leave the chunk size as a variable but there’s probably a sane value to pick. I’d probably do something in the 10-100s of kilobytes.

Imagine we have the following prepared statements (defined below):

insert_url (url STRING, stage1_data BYTES, chunk_size INT);
upsert_stage (url STRING, stage INT, stage_data BYTES, chunk_size INT);
query_stage (url STRING, stage INT);
query_stages (url STRING);
delete_url (url STRING);

We can interact with them by materializing the data at read time and decomposing it at write time:

demo@127.0.0.1:26257/demo> EXECUTE insert_url ('foo', repeat('a',  10)::BYTES, 4);
INSERT 3

Time: 3ms total (execution 3ms / network 0ms)

demo@127.0.0.1:26257/demo> EXECUTE query_stages ('foo');
  text | stage |    data
-------+-------+-------------
  foo  |     1 | aaaaaaaaaa
(1 row)

Time: 1ms total (execution 1ms / network 0ms)

demo@127.0.0.1:26257/demo> EXECUTE query_stages ('foo');
  text | stage |     data
-------+-------+---------------
  foo  |     1 | bbbbbbbbbbbb
(1 row)

Time: 1ms total (execution 1ms / network 0ms)

demo@127.0.0.1:26257/demo> EXECUTE upsert_stage ('foo', 2, repeat('c',  12)::BYTES, 4);
DELETE 0

Time: 3ms total (execution 3ms / network 0ms)

demo@127.0.0.1:26257/demo> EXECUTE query_stages ('foo');
  text | stage |     data
-------+-------+---------------
  foo  |     1 | bbbbbbbbbbbb
  foo  |     2 | cccccccccccc
(2 rows)


demo@127.0.0.1:26257/demo> EXECUTE delete_url ('foo');
DELETE 1

Time: 2ms total (execution 2ms / network 0ms)

demo@127.0.0.1:26257/demo> EXECUTE query_stages ('foo');
  text | stage | data
-------+-------+-------
(0 rows)

Time: 1ms total (execution 1ms / network 0ms)

The actual statements and tables:

CREATE TABLE temp_data_tbl (
	key STRING AS (sha256(url)) STORED NOT NULL PRIMARY KEY, url VARCHAR(1024) NOT NULL
);

CREATE TABLE temp_data_stage_chunks (
	key   STRING REFERENCES temp_data_tbl (key) ON DELETE CASCADE,
	stage INT8,
	seq   INT8,
	chunk BYTES,
	PRIMARY KEY (key, stage, seq)
);

PREPARE insert_url
     AS   WITH chunks AS (
						SELECT row_number() OVER (ORDER BY n) AS seq,
						       substring($2:::BYTES, n, $3) AS chunk
						  FROM ROWS FROM (generate_series(1, length($2:::BYTES), $3)) AS n
                      ),
               key AS (INSERT INTO temp_data_tbl (url) VALUES ($1) RETURNING key)
        INSERT
          INTO temp_data_stage_chunks (key, stage, seq, chunk)
        SELECT key, 1, seq, chunk
          FROM key, chunks;

PREPARE upsert_stage
     AS        WITH chunks AS (
							SELECT row_number() OVER (ORDER BY n) AS seq,
							       substring($3:::BYTES, n, $4) AS chunk
							  FROM ROWS FROM (generate_series(1, length($3:::BYTES), $4)) AS n
                           ),
                    key AS (SELECT sha256($1:::STRING) AS key),
                    upserted AS (
								   UPSERT
								     INTO temp_data_stage_chunks (key, stage, seq, chunk)
								   SELECT sha256($1), $2, seq, chunk
								     FROM chunks
								RETURNING key, seq
                             )
        DELETE FROM temp_data_stage_chunks
              WHERE seq > (SELECT max(seq) FROM upserted) AND key = (SELECT key FROM key);

PREPARE query_stages
     AS   SELECT $1::STRING, stage, concat_agg(chunk ORDER BY seq ASC) AS data
            FROM (SELECT * FROM temp_data_stage_chunks WHERE key = sha256($1:::STRING))
        GROUP BY stage;

PREPARE query_stage
     AS   SELECT $1::STRING, stage, concat_agg(chunk ORDER BY seq ASC) AS data
            FROM (
					SELECT *
					  FROM temp_data_stage_chunks
					 WHERE key = sha256($1:::STRING) AND stage = $2
                 )
        GROUP BY stage;

PREPARE delete_url AS DELETE FROM temp_data_tbl WHERE url = $1;

Thanks a lot the reply & the nice code/demo!
Ok, I think that I understood what you wrote (about splitting up data into chunks), but I’m not sure about what you mentioned here:

If the garbage for a given row ends up being more than 1GiB then you’ll experience backpressure.

What do you mean with “for a given row … 1GiB”?
I think that with the table structure that I mentioned the total max size of a row should be ~21.8MiB
(
[1KiB + 4MiB for the initial insert of url&step1 cols] + [1KiB + 4MiB + 4MiB for the update of step2 col] + [1KiB + 4MiB + 4MiB + 1.5MiB for the update of step3 col]
)… .
Therefore, I should still be far away from that 1GiB limit per “row”, or am I missing some important piece of information? Maybe something related to what you wrote “…Cockroach cannot split ranges within a single row”?
What kind of “ranges” are you referring to? :thinking:

Sorry if the questions might be stupid - as I said I went through the docs only once to get a general idea of the DB so probably I still don’t know a lot of stuff.

Btw. fyi, by looking at the contents of my current MyRocks temporary table (hosts currently 1.5 million rows, it will then grow up to max ~2 millions, then it start getting “deletes” to then go back towards almost 0, then the whole thing will start over again), the average size of the columns are:

  • url: ~72 bytes
  • step1: ~78 KiB
  • step2: ~145 KiB
  • step3: ~38 KiB

So, it’s not a lot.
It’s not impossible that the averages might be bigger in other runs, but I don’t think that there will be huge differences compared to this.
There might be from time to time 100s of rows which reach the max size limit of the columns, but the total average should still be similar.

Thank you :slight_smile:

It sounds like the single table with different column families should work for your use case.

What do you mean with “for a given row … 1GiB”?

You mentioned earlier that you may re-write a column. Each time you re-write a column, the old version becomes garbage. That garbage is still attributed to the row. Another thing you can do is shorten the GC TTL for this table to delete garbage after not too long. That should mitigate any problems with garbage. The default is 25h. You may want to go down to 5-10 minutes for this use case. That way, even if you end up having retrying happen for a long time, you’re not likely to get stuck on the garbage.

Ok, understood. (yeah, I’m already using the “ttl”-setting for one of my tables in MyRocks)
So, I’ll give cockroachDB a try - the documentation is nice and the DB seems to have interesting features. Let’s see what happens :stuck_out_tongue:
Thanks a lot for your help!