Operations(Insert/update/delete) on multiple column family in a table

For creating multiple column family on a table to separate frequent and rarely update columns seems interesting. but I cannot find details of operations other than the feature exist. for example, Can I insert/update/delete only a single column family and leave along other column family in the same table? any if syntax exist.

Please share anything that help me understand and put the multiple column family feature in development, and best practices for modeling, table design on the subject.

Thanks,

Update in particular will only write to the column family being updated. There is some nuance with NULL-ability and column families. If the column family does not have at least one not-null column then update operations which normally might not need to read the other column families will. Thus, if you want column families to work well in cockroachdb, it’s best to have at least one NOT NULL column per column family. There is no special syntax, it’s just about the fields implied by the update operation.

To make this more concrete, in YCSB, pretty much every operation operates on different columns. YCSB is also a very contended workload. In order to separate the contention we use the following schema in cockroachdb that puts each column in a different column family. This way an update to one column does not need to contend with updates to another column.

CREATE TABLE usertable (
    ycsb_key VARCHAR(255) PRIMARY KEY NOT NULL,
    FIELD0 TEXT NOT NULL,
    FIELD1 TEXT NOT NULL,
    FIELD2 TEXT NOT NULL,
    FIELD3 TEXT NOT NULL,
    FIELD4 TEXT NOT NULL,
    FIELD5 TEXT NOT NULL,
    FIELD6 TEXT NOT NULL,
    FIELD7 TEXT NOT NULL,
    FIELD8 TEXT NOT NULL,
    FIELD9 TEXT NOT NULL,
    FAMILY (ycsb_key),
    FAMILY (FIELD0),
    FAMILY (FIELD1),
    FAMILY (FIELD2),
    FAMILY (FIELD3),
    FAMILY (FIELD4),
    FAMILY (FIELD5),
    FAMILY (FIELD6),
    FAMILY (FIELD7),
    FAMILY (FIELD8),
    FAMILY (FIELD9)
)

Hope that helps.

Please elaborate on what will happen when updating a column family that violates the rule on “at least one not-null column” and why “update operations which normally might not need to read the other column families will.”

On the example with each column in a column family, it prevents contention on updates of each column. However, inserting a new row on this table will create multiple column family records (key/val) for each column. How much the cost in space and time/delay that may introduce comparing to default single family?

Please elaborate on what will happen when updating a column family that violates the rule on “at least one not-null column” and why “update operations which normally might not need to read the other column families will.”

If you don’t have a not-null value in the column family then an update will need to read the first column family to decide whether the row exists or not. For column families with all null columns we don’t write an entry if all of the values are NULL. Because of the SQL semantics, we need to know whether the row already existed. Just reading the single column family won’t have enough information to know that if the entry could be missing because of the NULLs. In that case, you may then contend with the other column families.

However, inserting a new row on this table will create multiple column family records (key/val) for each column. How much the cost in space and time/delay that may introduce comparing to default single family?

It costs very little in terms of time and space. The storage engine has prefix compression. Given we tag the columns inside the value if each column is in a single family, my guess is that the size overhead difference is near zero. The time/delay should be exactly the same. All of the work is done in the same request and the columns families are collocated. The cost will come when trying to scan all of the column families in the face of a high update rate. In that case you’ll need to skip over all of the garbage of all of the column families whereas if there were just one column family, there’d be no need to read over any garbage. Setting a shorter GC TTL for update-heavy workloads can help.