Independent Garbage Collection?

I have a question about the GC process. It seems the garbage collection for a range is done via Raft:

It collects keys to be GC’ed and sends requests to instruct replicas to GC same set of keys.

I’m curious on why this is preferred compared to the (lighter-weight) approach of each replica doing GC independently (but according to same GC policy). It is because it’s also GC’ing intents and abort caches and resolving intents and pushing txns should be done by the leader lease holder only? If so, can the user data (non-intents) be GC’ed independently?

It’s true that the writes performed by garbage collection would be a little cheaper if each replica did its GC independently, but it’s not clear that that would be a net win, because each node would then need to do all the read work associated with garbage collection (essentially a full scan of the data).

Additionally, performing garbage collection independently on each replica would complicate other parts of the system, such as the consistency checker, which would need to become aware of garbage collection policies so it can excluded data that might be eligible for GC on any replica.

If garbage collection is done by a custom RocksDB compaction filter, it doesn’t incur additional cost of full scan on each node, right? It seems to me that’s a better place to drop unwanted old data. Did you see any problem with this approach?

Consistency checker indeed would be more complicated.

Yes, we considered using a compaction filter (and even used one for this a long time ago), but we moved it to a raft command to make consistency checking (etc) easier. Also, while a compaction filter is cheaper, there are no guarantees about how often compactions occur so you may end up keeping garbage data around much longer.