How to maximize write throughput

We’re evaluating CockRoachDb for our use case, and starting by UPSERTing all our existing data (~100M rows, 4KB each, 30 columns, primary-key+1-index) into a 71-node (i3.4xlarge 16-core 2-SSD) cluster.

SQL Byte Traffic graph says we’re dumping 150MB/sec cluster-wide, which is ok. Looking at random nodes, the CPU utilization (via “htop”) seems to float between 50% and 100%, and “iostat -dxm 2” says the 2 SSD drives are writing about 100MB/sec combined, on average (about 20% utilization fwiw).

Thinking about the 71-node cluster overall, that’s about 7100MB/sec of writes, which is a 47x write amplification. That seems high to me… can you help me understand whether I’m doing something wrong, or my mental model of how CockRoachDb works is wrong. I can understand a 3x write increase due to replication. And I can imagine the data needs to move from logs to .sst files a few times over the lifetime of a row, but I’m surprised all that adds up to 47x. Also, I’m ignoring compression, which I would think would give back a factor of 2x or so (the data is mostly English text and urls).

Similar question for CPU… These are 2.3GHz machines, and doing an (admittedly very crude) estimation, at 50% CPU utilization means that cluster-wide (71 nodes x 16 cores), there are about 1.3T clock cycles per second running code, or about 8700 cycles per SQL byte written, which seems a little high.

The rows being written have a pretty random distribution of primary keys (the key is a GUID effectively). If the answer is just that there is way more SSTable churn happening under this workload than I expected, would it make sense to try to group writes into batches that are likely close enough to end up in the same key range? Any other tips for getting large amounts of data into CockRoachDb as fast as possible?

Running v1.2-alpha.20171204 with default options.


Hi @dustin,

Thanks for your interest. I’m interested to hear more about your cluster sizing requirements and expected production workload as well, if you are willing to share details about that.

In general, CockroachDB does exhibit a fairly high write amplification for a few reasons. I’ll try to outline them below.

3x for a replication factor of 3.
2x for the Raft log.
2x for RocksDB’s internal write-ahead log.

That’s a 12x write amplification off the bat.

On top of that, you have RocksDB’s internal compaction, which as you mention causes SSTable churn. It’s going to be worse for this kind of workload if you’re inserting in a totally random way, like you mention. The write amplification for compaction is somewhat hard to estimate. Let’s say for your loading workload it’s 2x, since you’re causing a lot of compaction by writing in a random distribution.

Then, there’s an extra factor because of the secondary index you’re adding. For every row, Cockroach also needs to duplicate the storage of the secondary index key and the primary index key. The size of this factor depends on how big your secondary index and primary index keys are. 2x here is an overestimate but let’s just pretend its 2x for now. If you added storing columns on that secondary index then perhaps this is accurate after all.

This gets us up to 48x, which is not quite what you’re seeing but is in the ballpark.

For this initial data loading step, I recommend that you take a look at our IMPORT CSV feature, which is designed to help import bulk data to avoid issues like you’re seeing.

Documentation for IMPORT CSV is available here:

1 Like

Thanks for the explanation. We don’t have any exact sizing requirements other than the obvious:

  • we want enough nodes that we don’t run out of disk space
  • we want enough nodes so that the system can keep up with the reads and writes.
    We just picked 71 initially as being “big enough” to hold our data. We’re still just getting started. :slight_smile:

For our use case, we actually don’t need much consistency guarantees – so long as it’s eventually consistent that would be good enough. I wonder if changing the isolation level to SNAPSHOT would improve write throughput? We’re going to try that, as well as trying to group the writes into keys that are likely to be resident on the same range.

The IMPORT CSV feature seems like it wants to be the only statement that CREATES the table. For much of our original data, we don’t have a single .csv file with all our 100TB+ data. Also, I’d be concerned that the single-gateway aspect of IMPORT currently would make that slow.

Got it - thanks for the info. IMPORT CSV will still likely be higher throughput, because it skips a lot of the steps that cause the write amplification you’re seeing and writes more-or-less directly to sstables. Additionally, it doesn’t use a single gateway for the actual importing - just for the kickoff of the command. You would split your CSV files n ways for n-way parallelism, tell the gateway where the files are located in cloud storage, and the gateway would handle informing all of the nodes where to get their chunk of the data.

However, I don’t think we’ve tested IMPORT CSV on a dataset of this magnitude. I’m looping in the engineer best suited to help with IMPORT, @mjibson, to provide more information if necessary.

SNAPSHOT isolation may help with latency, depending on how frequently you expect your writes to overlap with each other. Please let us know how that works for you, along with the write grouping strategy.

Currently IMPORT CSV uses a single node for data processing, and so is limited to CSV files that fit on disk on a single machine. However I’m actively working on our distributed implementation that will allow transformation of data to use all of the disks in the cluster to store temporary data. This should be done by 2.0 (the next major release).