How does CockroachDB deal with small disk corruption

I am doing some research around CockroachDB deals with silent disk corruption, and I can not seem to get a straight answer on how it is handled.

The only page that seems to talk about disk corruption is the Disaster Recovery page, which requires manual intervention. Not ideal for a high availability system.

The Production Checklist page mentions about RAID. But RAID alone does not prevent issues with old hardware, or filesystem bugs.

Does CockroachDB silently fix checksum errors from replicas? Is there some type of “automatic page repair”?

1 Like

No, cockroach does not currently silently fix checksum errors. It does scan the data to detect them on replicas. The detection occurs at two levels: at a high level there are range statistics. Mismatches here generally represent logical corruption. The data is written to disk with inline checksums per block. When scanning that data we check the checksums. If a corruption is discovered, indeed, that node will crash. This does not alone lead to unavailblity; cockroach is able to up-replicate the data which lived on that node. You are correct that simultaneous detection of corruption may be problematic. We generally assume that these sort of bugs or hardware failures (which we honestly almost never see), are rare and uncorrelated. As cluster sizes and data volumes grow larger, this may become more likely. Given cockroach almost always is run on SSDs and mostly is run in the cloud and generally is dealing with storage/vCPU ratios such that it is not too expensive to scrub all data reasonably frequently; this tends to be less of problem. One can increase replication factors to mitigate the problem. I feel that 5x replication is often a good choice.

It is also the case that one could imagine doing recovery on a per-range basis. I think that part of the current logic is that such corruption may indicate faulty hardware – in which case, up-replicating off of that device actively would be a good idea. Cockroach is not particularly good at dealing with degraded hardware; it much prefers a crash-failure mode. These are all topics to be address as the product matures.

1 Like

Oh, I should note that if corruption is detected at the logical level; either the stats or checksum for the entire range does not match, then only the minority side of the mismatch will crash. If, somehow, there are multiple concurrent corruptions, then indeed you will lose quorum.

1 Like

Thank you for the answer! While data corruption may happen unfrequently, it is important for me to make sure CRDB is covering all their bases. Trust, but verify.

This “scrub” functionality does not seem to be talked anywhere in the wiki. Is the checksum also verified every time a page is read from disk?

When you say “The node will crash”- does that mean that the whole data folder for that node will be deleted? Or will the node restart with the ranges that are corrupted being deleted? Or will be the node be ejected from the cluster? I would imagine that k8s would not just automatically bring the node back online with corrupt data?

And of course, I do not expect CRDB do be able to recover from a loss of quorum. Nor do I wish to run CRDB on degraded hardware. Nor do I expect multiple simultaneous disk corruptions. But failures can happen anywhere, and at any time, and being prepared is important.

For your consideration, I think it would be good for the documentation to have at least one paragraph talking about scrubbing. I would argue that disk/file corruption is an item pretty high in the minds of sysadmins, and help put people like me at ease.

1 Like

There’s more likely coming soon in the way of product enhancements on the front of on-disk corruption, so this answer might very well change over time. But currently, the file / store isn’t deleted if data corruption is observed. The node could theoretically start up again and rejoin the cluster, but any reads on files/blocks of files with checksum mismatches will cause the node to crash again.

There’s no risk to propagate corrupt data to other nodes as any operation attempting to do that would crash the node. The ideal hope at that point is that the repeated crashes would get the operator’s attention and encourage them to replace it with a clean node.

2 Likes