CockRoachDB use in HealthCare | Questions with Use Cases

This topic is about using CockRoachDB within the HealthCare sector; and I have several questions.

For each question I will try to explain the use case to my best of my ability.

#1 How to query the range fragmentation. (Fragmentation)
Use Case:
I would like to know how I can query the range fragmentation in relation to a interleaved table configuration.

Setup: Patient (Parent) -> Child tables -> Child Tables; examples of child tables; orders, laboratory results, documents etc…

The current default range is 64mb.

I would like to answer the following question:
To which size do I need to increase the range size, because I know for a fact that the majority of patients use more than a single range.

Therefor I would like to query the database and get a report back which can tell me the 90% would fit into a range size of 256MB. Preferably I would like to be able to give a threshold within the query. The idea behind this calculation is that it will calculate over the entire interleaved hierarchy for the patient.

Like this:
CALCULATE RANGE_SIZE OVER patient THRESHOLD 90;

Result:
Total Rows: 1,859,000 Rows
| Threshold | Size |
| 90 | < 164 MB |

Which would tell me that I need to increase my range size to 256 MB.

Question: Is this possible; If not any chance something like this can become a feature ?

#2 Does increasing the range size also reorder the data ?
When I have patient data fragmented over several ranges. Does the current auto re-balancing after increase the range size also reorder the ranges and tries to put the patient data back into a single range as much as possible, like when it does when storing patient data and there is still space left in the current range. I could not find any information about this within the docs.

#3 Create Range on new patient INSERT and reserving the range for a single patient.

Is it possible to tell the parent interleaved table to create a new range when a new patient is inserted into the top level patient table. This would include telling cockroach to keep this entire range space reserved for this patient only. And when the range is full it would simply create a new range also belonging to this patient.

Why:
Imagine the following scenario. (This is also linked to #1). 90% of the patient data would fit into 256MB. By telling the top interleaved table to create a new range for each patient. All data for a patient would be kept into a single range. Which would make sure there is always high performance.

The other 10% can be moved to a separate nodes with its own replication zone for patient which consists for a large dataset. (It is unclear to me if this would be possible even with the enterprise license) because I don’t want to pin a row to a specific replication zone but a row with all interleaved child data into a range on a set of nodes in which the range size would be about 1GB for this specific replication zone only, which allows me to store PET-, CT-, MRI-Scans, and still provide me with a lot of performance.

Hopefully you can see that these items are all interconnected; hope anyone can provide some answers or some insight.

Thanks in advance.
Gert-Jan Timmer

Hey @GJRTimmer,

Thanks for the feedback and thorough questions, this is super helpful. Right now, we really don’t recommend increasing the range size beyond the default of 64MiB; doing so would substantially increase the risk of running out of memory, and it wouldn’t get you what you’re looking for (single-range storage). We currently don’t have a feature that would guarantee range splits along the patient record. I think that answers your second and third questions as well - we currently don’t have a way of splitting ranges based on the value of the parent in an interleaved table.

We also don’t recommend saving rows > 1MB, and it sounds like you have rows that are significantly larger than that, so the recommendation would be to move the larger files to external storage and referring to them within CRDB. This will likely get you a much larger boost in performance, though I can also understand why it may seem like a more difficult path.

Is the question being driven by the desire to avoid tradeoffs like those described in the interleaved table document? Maybe we can help out another way: could you describe the problems that you’re trying to solve here? Have you set up a test cluster with the defaults and run into performance issues? Maybe we can help by looking at a representative query and schema and seeing if there’s another way to approach the problem.

@tim-o

Thank you for your reply, very helpful.

Yes the questions are based upon the documentation.

Patient data must be stored for several years depending on the Nations legislation. CockRoachDB is one of the very few which has a design where one of its focuses is on fault tolerance and resiliency.

The questions comes from a performance point of view. I was considering using CockRoach also to store Patient documents and scans like MRI-, PET-, CT-Scans (Scans are common to be at least a few GB’s). But I will consider using a separate storage system for it. Really like the fault tolerance and replication of CockRoach.

Like the docs states; if you have an interleaved hierarchy the data is stores as follows:

/patient/1
/patient/1/orders/5
/patient/1/laboratory/results

CockRoachDB will try to keep as much of the interleaved data together within a range. So this is where my first and second question comes from.

Could you elaborate a little more on the second item, I’m curious to know that if I have enough memory and decide to increase the range will cockroach reorder the data as described in #2 ? Or will the data still be fragmented ?

Little bit off-topic but still HealthCare:
In case you are interested in some additional feature requests for CockRoachDB.
I’m a Senior Integration Consultant with a specialty in HealthCare; within HealthCare one of the common ways of integration between systems and data format (which is due to other suppliers; so unfortunately we cannot change the world that fast) is the use of base64. I’ve noticed that CockRoachDB does not have base64 encode and decode functions. Could you consider adding them ?

I’ll let Tim followup on your first question.

I’m happy to say base64 was added in v2.1.0-alpha.20180507 as one of the valid formats for the encode and decode functions.

Seen the docs; really great. Thanks, keep up the good work.

Hey @GJRTimmer,

Like the docs states; if you have an interleaved hierarchy the data is stores as follows:

/patient/1
/patient/1/orders/5
/patient/1/laboratory/results

CockRoachDB will try to keep as much of the interleaved data together within a range. So this is where my first and second question comes from.

Could you elaborate a little more on the second item, I’m curious to know that if I have enough memory and decide to increase the range will cockroach reorder the data as described in #2 ? Or will the data still be fragmented ?

All data in CRDB is stored in a monolithic sorted map of key-value pairs, regardless of interleaving. Whether or not the data spans multiple ranges, patient records will be contiguous. Say you have an interleaved table of patients and lab results. For the sake of the example, let’s say each patient record is relatively tiny, while results records are just under 1mb. You start with two patients with 25 results a piece.

At the beginning, your key space will look something like:

**Range 1:**
/patient/1
/patient/1/results/1
/patient/1/results/2
…
/patient/1/results/25
/patient/2/results/1
/patient/2/results/2
…
/patient/2/results/25

Say you then insert an extra 25 results records for patient 1, pushing the range up to 75mb. The range split would end up looking something like:

**Range 1:**
/patient/1
/patient/1/results/1
...
/patient/1/results/25

**Range 2:**
/patient/1/results/26
…
/patient/1/results/50

**Range 3:**
/patient/2/results/1
…
/patient/2/results/25

So the data remains ordered, even if it’s split among multiple ranges. Each node maintains a copy of the metadata describing the location and contents of ranges, so there’s negligible cost to find out which range contains your data. I think you’re assuming that you’ll get a large performance benefit from creating large ranges to keep all data within a single range, but we don’t see that in practice. Given that latency between nodes within a DC is generally sub-millisecond, the latency involved in hopping across ranges or nodes to retrieve data is negligible.

You can also probably intuit this from the examples above, but it’s worth highlighting that we do not offer explicit control of where and when range splits occur, so even if you increase the memory and max range size, we might end up splitting a range between interleaved values.

In practice, we’re able to achieve ~2ms reads and ~4ms writes with the default settings when latency between nodes is 1ms, so even though it seems like the default would be less than ideal, we’re still able to achieve performance on par with RDBMSs that do not make the same consistency and availability guarantees. If you’re finding that your results are way off from there, looking at the schema, hardware, and queries would probably be a good next step.

Hope that helps!

@tim-o This helps greatly.

For now final question concerning healthcare.
Its required by several legislations that there is an audit trail. For some parts even the changes within the database must be tracked.

The question is; normally this would be solved by database design. There are several common patterns which could be used, or does or will in the future cockroach have some kind of audit logging available. Which includes recording data changes.

Would be nice if you simply activate some option on a table and it will auto record all changes; on lower level maybe even record all changes to a key/value.

Any thoughts or idea about this ?

Audit logs are actually in the product already as an experimental feature: https://www.cockroachlabs.com/docs/v2.0/sql-audit-logging.html#main-content

Given that it’s experimental, feedback is always valuable - let us know if it suits your needs.

Hi @GJRTimmer, feel free to reach out to me with any questions or feature requests around the SQL auditing feature. As Tim mentioned, it’s a new, experimental feature, and we’d love to hear any feedback about it that you may have.