CockroachDB : 19.2
Kubernetes : 1.15.3 (Run on top of Ubuntu 18.04.3 LTS)
While I know that CockroachDB is not supposed to handle large record sizes, and storing files inside CockroachDB is frowned upon. But I have come across a possible bug. As I understand large record sizes would only impact performance, not bring down the entire node. What we have done is build a file store inside CockroachDB by splitting up the files into 100k chunks and then storing each chunk in a record. Since most of the files that we would store would be smaller than 100k it does seem like a solution that will work, only every now and then would we possibly store files larger, say 5 - 12MB. There is only one table called file and it looks like so:
CREATE TABLE file ( account_id INT8 NOT NULL, user_id INT8 NOT NULL, path STRING NOT NULL, "offset" INT8 NOT NULL, data BYTES NULL, size INT8 NOT NULL, CONSTRAINT "primary" PRIMARY KEY (account_id ASC, user_id ASC, path ASC, "offset" ASC), FAMILY "primary" (account_id, user_id, path, "offset", data, size) )
We are using PHP 7’s pg_XXX commands to push and pull data out of this table and DBeaver to administrate it. The cluster is very active with ± 100 records constantly flowing into other tables (NOT this one) every 10 seconds. And all is healthy and running well for months now, until we start to insert data into this table “file”. Sometimes it works, and sometimes it doesn’t. Sometimes it crashes while we are inserting data and other times it crashes when we are reading data from this table. Sometimes it’s when we interact with it using PHP and other times it crashes when we use DBeaver. We have also seen that increasing the chunk size larger, to let’s say 500k does make it crash sooner. So 100k does allow us to push and pull ‘files’ into this table a couple of times (about 10 times) then it would crash. But lowering the amount below 100k to lets say 10k does start making the entire solution less feasible. Because storing a 12MB file in 10k chunks would result in 1,200 records being spawned. So managing it becomes clunky.
When it does crash the node goes offline completely. But it does seem to ONLY crash the node that we are connecting to (via Kubernetes’ external IP). So I am not 100% sure if this is a Kubernetes bug or a CockroachDB Bug. But it does seem like the network simply goes away and that node is no longer able to communicate with the rest of the cluster. The only recovery after that is to restart the entire server that node lives on. And there are still 80% RAM available before, during and after the crash (±6GB Free). Here are some logs:
Everything is still all fine and dandy over here:
Then this happens:
And then this:
Since the node crashes completely I am not able to do log dumps via Kubernetes, so that’s why I could only get screenshots.
Any idea what could be causing this? Or what other logs I can look at to see if I can find a problem?