Large records, if I care most about READ performance

So CRDB Team, I am looking for some guidance here on suitability of CDRB for a particular workload. I thank you in advance for sharing your expertise.

My workload involves one-time ingest of both small structured records (aka transaction type records) and documents. From there the vast majority of the workload is SELECT statements to find records for review. Small updates occur on record metadata (reviewed: Yes, reviewedBy: Somebody, datestamp) on a per-record basis.

I understand the write amplification situation and the 2 MB recommended limit on records to maintain performance.

However, and here’s the crux of the question: If I care 95% about SELECTs / reads and can tolerate slower ingest performance (batch loading), can I use CRDB to store documents in which some are up to say 30MB or 50MB per file? I am aware that these large records will slow down INSERTs as they go in. Does putting the document data into a JSONB blob help me? Example {extension: docx, moddate: ___, content: " … bytes …"}.

What say you fine CRDB experts?

And thank you for making such a stellar product! (PS. Really really want/need full text search solution! CRDB would be unstoppable with FTS.)

Hey @DaveA,

It’s not just the speed of inserts that’s of concern here - the tl;dr is that in CRDB, if you update a 30MB file, you’ll and up with copies of both the updated and previous version until the previous version passes the GC threshold (by default 24h). We currently restrict range size and transaction size to 64MB, and we can’t guarantee stability if the range size is increased, so you can see how this could quickly lead to trouble.

For a workload like you’re describing, we’d recommend storing the larger files somewhere external (hdfs/s3/ceph/etc) and pointing to within CRDB instead.

Agreed regarding full text search! We actually have a product request for that already - it’s currently not scheduled, so any details you can provide about your use case would be helpful in prioritizing.

Tim:
Thanks much for the reply and helpful info.
Really looking for a way to make this work in CRDB. What if (1) docs never get updated once INSERTED–only ever read one at a time by ID. Say in a Documents table whereas we put the dynamic data / metadata, and the extracted text from the documents (all this is in total always 1-2MB max).

Sorry to pester on this topic but it’s an important use case and if there’s a way to do this – separate DB, separate READ ONLY table, etc. – that would be amazing.

Has there been any thought about a FUSE file system type hook in CRDB - like Postgres does with foreign data wrappers. This might get to a nice balance of relevant data in CRDB with great perf and scale and blob storage managed by CRDB but with data outside of the storage engine.

And on the FTS, I am well aware of the request, following it closely and hoping it gets on the schedule ASAP.

Thanks again!

Hey @DaveA,

We’re definitely aware of the general need to support the storage of larger files - a separate DB or read only file system are both ideas worth exploring further. We’re probably talking several versions out though, since the work was not done in 2.1 (due in October).

I can’t think of a reason why data that was never updated, and therefore at no risk of exceeding 64MiB, would not work - it might be worth a test, but proceed with caution since this isn’t something we’ve test against.

I’m sure we have an issue open for tracking the request to support larger rows - once I dig it up I’ll post here so you can add your feedback.

Turns out it’s our last triple digit issue. If you’d like to subscribe, you’ll get updates once it’s prioritized for release.