How to find out what reads so much?

I’m running an insert test (described in depth in https://github.com/cockroachdb/cockroach/issues/38778) on a three (hardware, two CRDB per machine) node HDD hybrid (HDD for general storage and SSD as read cache) system.
What I can observe is with the number of committed transactions the performance deteriorates quickly:

and while 90th percentile latency remains somewhat consistent, 99th percentile drops significantly:


Disk read bytes show a massive amount of increase after a given time:



Also, it’s quite imbalanced:

You could say that I’m doing something wrong (stressing one hot range exclusively), but what’s strange here is the same physical node has the outstanding amount of reads, so I’m not sure.
/_status/hotranges doesn’t show this type of imbalance.

Maybe it’s after running out of RocksDB cache, because the first ~day was quiet disk read-wise (while writes happened of course):

So I would like to know what happens here, how can I dig deeper to understand why there are more reads with the number of inserted rows/database sizes increasing and how to handle this more effectively?

I think what you are observing is the result of RocksDB compacting the SSTables in the background as the data gets written. This activity should increase both with the number of bytes and the number of ranges.

You can confirm this hypothesis by checking that the reads stabilize then reduce significantly when you stop the write-heavy workload.

(There will always be a little amount of r/w traffic in the background due to background statistics collection and GC activity.)

A side question - any reason why you’d be running 2 nodes on the same hw server? This seems a bit unusal (and counter-productive to the availability/perf trade-offs of crdb).

With Elasticsearch (which is somewhat similar due to how Lucene works) I could save a huge amount of IO (and make it lot more performant) with late night manual compactions (called segment merging there).
This works by making a large segment in the night, which won’t be touched during the loaded period, therefore making compactions a lot smaller, because those will happen only on the data written throughout the day (the busy period).
Is there anything similar which can be done here? Or any RocksDB options to set/watch for?
Also, could an approach like Pingcap’s Titan (or Dgraph’s Badger) help here?

As for the 2 nodes on the same server: this is just a test setup and I wanted to have more inter-node communication to stress that part as well.
For prod, I would do one node per machine.

I do not believe CockroachDB provides knobs to configure RocksDB compaction, at least not in a way that it stops entirely until you re-enable it manually.

Maybe @peter has a better idea?