I am running a three node Cockroachdb cluster. We have a small testing framework that can inject read and write errors into applications and emulate disk corruptions to test their behavior. We applied this to a cockroachdb cluster.
Initial state: one database named mydb, one table within it named mytable and one row in that table with two columns (id int, value varchar). The value column of the row is just a 8KB string.
- On injecting read errors in blocks of the rocksdb log file and performing a read query, we sometimes see that the cluster can become unavailable and sometimes the query fails. For example, when there is an error when reading the first block of the rocksdb log file on the master, the client would get an error message something like below:
Exception:database “mydb” does not exist.
On retrying, the client would get Exception:could not connect to server: Connection refused. Is the server running on host “localhost” (127.0.0.1) and accepting TCP/IP connections on port 26257?
And then the node crashes. Surprisingly one other node in the cluster also crashes after this, although the fault was not injected in that node; also, the third node just stops responding to queries (I believe it is because there is no majority at this point).
When some other block of the rocksdb log file hits a read error on one node, queries to all nodes result in the following error:
Exception:table “mytable” does not exist.
We could not find any patterns but the queries sometimes fail with ‘table does not exist’ or sometimes with ‘db does not exist’ message and sometimes just crashes and also renders the cluster unusable.
- When emulating disk corruptions, cockroachdb can sometimes detect the problem using a checksum mismatch. Specifically, when certain blocks in the SST file are corrupted, we get the following error:
Exception:Corruption: block checksum mismatch
All nodes throw the same error (although only one node is corrupted). Sometimes, when a block in the log file is corrupted, cockroachdb silently returns no rows (data loss – as there is one row in the table and it is safely replicated on three nodes and persisted on disk).
The logs for some of the observed problems can be found here: http://pages.cs.wisc.edu/~ra/cockroachbugs.tar.gz
I would be happy to share and discuss the entire testing results.