Is it safe to disable fsync?

I’m wondering about the consequences of disabling fsync on CockroachDB nodes.
Could you please shed some light on this issue?
Of course I’m aware that it’s not recommended at all.

So what could happen?

  1. during normal conditions, this should be the same as if fsync is turned on, writes are buffered to RAM for a longer amount of time and they reach the stable storage at an undetermined time. Let’s assume the file system’s consistency is not affected, it will be consistent with the file system image last written out.
  2. failure conditions (let’s assume the default replication factor of 3) I can think of:
    2.2. one node crashes/loses power
    2.2.1. the node comes back before all of its ranges could be replicated to another nodes
    2.2.2. the node comes back after all of its ranges are already replicated to another nodes (not sure about whether these two are really different, or the same)
    2.3. two nodes crash/lose power. Besides 2.2.1. and 2.2.2. there are two additional scenarios here:
    2.3.1. the nodes will have the same “version” of their ranges after reboot, whether because they crashed at exactly the same time (with exactly the same disk image) or because there were no writes/modifications
    2.3.2. the nodes will have different versions after reboot
    2.4. all of the replicas crash/lose power. 2.2 and 2.3 relevant here, with a multitude of when and how replicas will rejoin the cluster

I feel that these problems are very similar to what can happen in a cluster with fsync enabled, so am I safe when I think with disabling fsync, the consistency shouldn’t be affected, only the data durability?
So if it’s guaranteed that without fsync all data will be written out every 5 seconds, you can lose up to 5 seconds of data, but the cluster should be able to recover from every situations?

Hi @bra,

It is not safe to disable fsync with CockroachDB. If you attempt to do this, you’ll quickly discover your CockroachDB nodes panic’ing due to Raft-level invariants being violated (this is pretty easy to test if you’re curious to actually see this happening in practice). The reason you’ll see problems so quickly is that we configure RocksDB to buffer writes in process memory until an fsync is requested. This improves performance, but effectively makes a process crash similar to a machine crash. While we could go back to the old behavior of writing immediately to the filesystem and calling fsync separately for durability, all that does is make the Raft-level invariant failures somewhat rarer (because machine crashes are rarer than process crashes), it does not eliminate them.

What are these Raft-level invariants? Raft requires that certain operations, such as voting for the Raft leader and appending entries to the Raft log, must be durable. Our Raft implementation cannot handle a situation in which these invariants are violated. I can imagine a consensus protocol designed around non-durable writes, but such a protocol would not be Raft (and does not yet exist as far as I know).

Hi,

Thanks for the quick answer!
While doing testing on FreeBSD/zfs with disabled fsync, I couldn’t (yet) see those crashes. I don’t really understand why I should see them.
Could you please share some pointers about that RocksDB setting?
Also I don’t really understand this:

we configure RocksDB to buffer writes in process memory until an fsync is requested

Is this about these settings: https://github.com/facebook/rocksdb/wiki/WAL-Performance
Are you describing the async mode, where you excplicitly have to call DB::FlushWAL()?

BTW, disabling fsync may need some explanation I guess. On zfs, you can set the fsync parameter to standard, always and disabled, which means: only fsync on fsync() calls (or equivalent of course, you can achieve sync writes several ways), do an fsync after each writes, no matter what (always) and lastly do nothing on fsync calls (disabled).
The latter is what I’m referring to when saying disabling fsync.
This means nothing from the application perspective, it must see exactly the same (although the fsync calls return immediately), the only change is the buffers are not synced to stable storage at the time of the call and will only be done when various other conditions (like the accumulated buffer size or the age of the data in it exceeds a value) are met.
So I’m about 100% sure that an application shouldn’t (and I think can’t) crash because of that setting, but I guess that may come from the lack of proper definition on “disabling fsync” on my part, sorry for that! I’m not talking about removing fsync() (or equivalent) from the code itself, which really could make a crash if it’s desined that way.

Trying to understand the above two statements again:

we configure RocksDB to buffer writes in process memory until an fsync is requested
we could go back to the old behavior of writing immediately to the filesystem and calling fsync separately for durability

I can see the difference between the two, that’s OK. What I don’t really undestand if the difference it makes from the viewpoint of the last paragraph.
Is it because if you write to an OS buffer and do an fsync irregurarly makes it more possible that the OS has already flushed part of the data?
But that also could happen if you are buffering into process memory and write to the same OS buffer and call an fsync immediately. Depending on the file system, the amount of data and many other factors, partial writes could occur, no?
The process of writing is the same if you do fsyncs, only the window is smaller (if you accumulate data in process memory).
Of course there could be other factors, like mixing IOs (which fsync will write for that fd unconditionally, while buffering into process memory can make those writes/syncs selective).
Sorry for asking so much, but I would like to understand it. :slight_smile:

BTW, this is what happens if I enable fsync on a live system, receiving a constant amount of inserts:

(this is with HDDs, but a flash backed write cache in front of them)
Nothing really unexpected, only those spikes look somewhat bad.

What are these Raft-level invariants? Raft requires that certain operations, such as voting for the Raft leader and appending entries to the Raft log, must be durable. Our Raft implementation cannot handle a situation in which these invariants are violated. I can imagine a consensus protocol designed around non-durable writes, but such a protocol would not be Raft (and does not yet exist as far as I know).

The key here are the acknowledged writes by other members? I mean even when doing fsyncs on the log, a machine can crash and loose the data (although the others won’t receive an ack, which is of course a big difference to the fsync disabled case) and I guess Raft will be able to handle that.
Won’t the most up to date member will be the winner in this case?

Thanks for your patience! :slight_smile:

Could you please share some pointers about that RocksDB setting?

It isn’t a RocksDB setting, but a CockroachDB setting.

kv.raft_log.disable_synchronization_unsafe

Note that warnings on the package, though: Setting to true risks data loss or data corruption on server crashes. The setting is meant for internal testing only and SHOULD NOT be used in production.

Is this about these settings: https://github.com/facebook/rocksdb/wiki/WAL-Performance
Are you describing the async mode, where you excplicitly have to call DB::FlushWAL() ?

Yes, DB::FlushWAL() is what I was hinting at.

On zfs, you can set the fsync parameter to standard , always and disabled , which means: only fsync on fsync() calls (or equivalent of course, you can achieve sync writes several ways), do an fsync after each writes, no matter what (always) and lastly do nothing on fsync calls (disabled).

I wasn’t aware ZFS gave this level of control. If you use disabled, you’ll be violating invariants that CockroachDB and Raft expect. If a node crashes, your cluster can get into a state which it can’t recover from on its own.

The key here are the acknowledged writes by other members? I mean even when doing fsyncs on the log, a machine can crash and loose the data (although the others won’t receive an ack, which is of course a big difference to the fsync disabled case) and I guess Raft will be able to handle that.
Won’t the most up to date member will be the winner in this case?

With fsync disabled, you don’t necessarily lose a prefix of the operations you performed. Rather, you lose something that is often a prefix, but is not guaranteed to be (a later write might be made durable before an earlier one). Even if Raft could somehow rol back to the most up to date member, you also have to consider what happens to distributed transactions that operate on multiple ranges. If one of those ranges rolls back to some earlier state, then a transaction that you thought was committed was actually not durable and the D in ACID has been tossed out the window.

I’m a bit distracted right now and can’t provide a step-by-step guide to the badness, but my short summary is: don’t disable fsync(). CockroachDB can definitely not handle the resulting recovery scenarios.

Thanks a lot, Peter! I think I got the answer I needed: I can tolerate losing some writes if all of the replicas’ OS/HW suddenly die, but I can’t tolerate data corruption or cluster instability/inability to work after such a condition happens.