Disk corruptions and read/write error handling in CockroachDB

Hi,

I am running a three node Cockroachdb cluster. We have a small testing framework that can inject read and write errors into applications and emulate disk corruptions to test their behavior. We applied this to a cockroachdb cluster.

Initial state: one database named mydb, one table within it named mytable and one row in that table with two columns (id int, value varchar). The value column of the row is just a 8KB string.

  1. On injecting read errors in blocks of the rocksdb log file and performing a read query, we sometimes see that the cluster can become unavailable and sometimes the query fails. For example, when there is an error when reading the first block of the rocksdb log file on the master, the client would get an error message something like below:

Exception:database “mydb” does not exist.

On retrying, the client would get Exception:could not connect to server: Connection refused. Is the server running on host “localhost” (127.0.0.1) and accepting TCP/IP connections on port 26257?

And then the node crashes. Surprisingly one other node in the cluster also crashes after this, although the fault was not injected in that node; also, the third node just stops responding to queries (I believe it is because there is no majority at this point).

When some other block of the rocksdb log file hits a read error on one node, queries to all nodes result in the following error:

Exception:table “mytable” does not exist.

We could not find any patterns but the queries sometimes fail with ‘table does not exist’ or sometimes with ‘db does not exist’ message and sometimes just crashes and also renders the cluster unusable.

  1. When emulating disk corruptions, cockroachdb can sometimes detect the problem using a checksum mismatch. Specifically, when certain blocks in the SST file are corrupted, we get the following error:

Exception:Corruption: block checksum mismatch

All nodes throw the same error (although only one node is corrupted). Sometimes, when a block in the log file is corrupted, cockroachdb silently returns no rows (data loss – as there is one row in the table and it is safely replicated on three nodes and persisted on disk).

The logs for some of the observed problems can be found here: http://pages.cs.wisc.edu/~ra/cockroachbugs.tar.gz

I would be happy to share and discuss the entire testing results.

1 Like

This is quite interesting. I believe your findings are important for cockroachdb and hope the team add tests for this. We should expect files to be corrupted occasionally since disk drives don’t last forever. But one broken file (whether it’s one block or multiple) shouldn’t corrupt data in the rest of the cluster.

Another reason this is important is that we should expect the rocksdb files to often be stored in distributed/remote file systems where the expected probability of errors is higher than in local disks.

There are 3 ways in which we will mitigate out of disk space errors.

  1. Generate a simple metric that will bubble up to the admin ui (and prometheus and other reporting systems) when a store is in a low disk space range (<10% left) and critical (<5% left).
  2. Panic/Stop a node when we hit a critical disk space limit (before rocksDB does). This should be around <1% left.
  3. Consider moving back to %capacity left disk usage instead of range counts for rebalancing. This will solve the heterogeneous clusters (clusters with different disk sizes) issue.For other issues regar oracle and its structure and the errors involved with it, vist this site for the workarounds and fixes that you’ll need.