Performance of Insert statements

Hi, I’ve been evaluating the performance of CockroachDB for a project at my workplace. Our requirement is that in a single table we need to store up to 10^12 records, which can be queried by N properties. The volume of the properties is determined at application runtime, and so we needed a flexible method of storing these properties and then being able to quickly retrieve sets of records based on a subset of the values.

After trying various options we settled on using a JSONB typed column to store the properties, so the table schema looks like this:

CREATE TABLE cell events (
ckm_partition INT8 NOT NULL,
revision_partition INT8 NOT NULL,
revision INT8 NOT NULL,
value DECIMAL(18,8) NOT NULL,
composite_key JSONB NOT NULL,
event_sequence INT8 NOT NULL,
CONSTRAINT cellevents_pk PRIMARY KEY(revision_partition DESC, revision DESC, event_sequence ASC)
);

I’m running the CockroachDB cluster on GKE with 5 nodes. Each node has the spec n1-highcpu-32, and each has local SSD configuration, and this is running on version 2.1.5 of CockroachDB.

Having run load tests for both read and write biased loads, I’ve found that reads are fine with an average read time of ~6ms, which is great. However, writes are not quite as good, considering that the rows being inserted will only ever be done in singularly, I’m unable to do any batching. In the load test I’ve performed up to 1 million individual row writes into a non-empty table, and the average write speed is ~48ms which means a writes/sec of ~21. This just isn’t fast enough at all.

Another experiment I’ve tried, is to create a very simple table such as:

CREATE TABLE test (
col1 INT8 NOT NULL,
col2 INT8 NOT NULL,
col3 INT8 NOT NULL,
CONSTRAINT test_pk PRIMARY KEY(col1, col2, col3)
);

Just inserting a single row into the empty table takes ~13ms, this curiously slow.

So the question is: Am I doing something wrong, and is there some configuration that can explain the slow speed of the inserts?

Thanks

Can you please insert one row of data into each node and compare the latencies? Do you use load balancer in the front of your nodes cluster?

Hi Mark,
why you worry about the latency of write operations? Usually people worry about the latency of read operation more than that of write operations.

I worry about reads and writes as there is going to be multiple writers from a stream of commands on Kafka, each command is a users instructions to change something and we need to then be able to read that change as fast as possible.

OK, so I’ve done a single insert statement on each node, here are the results:

n1: 20.898747ms
n2: 21.912621ms
n3: 17.422221ms
n4: 20.177259ms
n5: 18.930559ms

As for a load balancer, we have in place a standard GCP ip load balancer which routes the connection requests using round robin.

Hey @mark_balmer

Can you provide an output of the network latencies from the AdminUI from n1? Let’s start there and work our way down to find the bottleneck.

Cheers,
Ricardo

So, after much searching it seems that on GCP, local SSD’s are actually significantly slower than network attached SSDs ¯_(ツ)_/¯

After switching back to network attached disks, the insert speed for a single row has gone down to ~6ms, which is much more like what I was expecting.

The next task to investigate is how fast can bulk imports be made. I’m hoping I can get 1 million rows in the table in around 1 sec. I’m going to try a multi-threaded batch insert statement for this as using the bulk import command is not possible, unless at some point it can bulk import into an existing table.

Hey @mark_balmer,

That’s interesting to be aware of. Is it possible to share the location of where you got this information of local SSDs vs network attached SSDs, or was this just done from testing?

Let me know how the performance of your batch inserts go.

This was just via testing, it’s very counter intuitive though. When setting up the local disks it does mention that it’s beta, but still ~7x slower is surprising to say the least!

@rickrock I do have a couple of questions regarding bulk importing (not using the IMPORT command):

  1. I assume the most efficient method of using multiple write threads would be to have each thread connect to different cluster nodes? i.e, 5 nodes, 5 threads each connecting to a node.
  2. I’ve read some posts about having less sequential PK having a scattering effect of on the writes. Do you have any more detailed documentation around this? Given point 1, would a single thread’s insert batch of say 1000 rows be pushed from the connected node across the other 4 nodes?

Hey @mark_balmer

Those are great questions and I would be happy to answer them. You are correct that it is always going to be more performant in spreading the load across all the nodes, in order to balance the processing done across the entire cluster. This is why we typically suggesting putting a load balancer in front of the cluster, and using that to distribute the load, discussed here. One thing to bear in mind is that the leaseholder of a given range is the one that receives the write requests, and if it is different then the gateway that receives the command, then an extra hop needs to be accounted for. We go into more detail about that here.

To answer your second question, we also have some documentation on our recommendation of using UUID as the PK in order to help prevent the sequential writes. Since the data is laid out in the key-value store lexographically, that would mean that the sequential PKs are going to be lined up, and so will most likely be on the same range. If a write is being performed, this could slow the performance down as we are not utilizing the full distributed cluster to help share the workload. We go into more detail here, may have to scroll down a little. So in your example, the single thread would go to a single gateway node, which would process the request on its own. It would have to reach out to the range’s leaseholder’s, which would then go through the steps of a write, and then ack back to the gateway node.

Let me know if there are any other questions.

Hi Ricardo,
how would using a uuid as primary key compare with a composite primary key, consisting of thread# and a sequence. Doing so would give each thread an own range to write the data in and a sequence next value is quicker to calculate to I guess this must be quicker than using uuid’s?
If this is about bulk loading, each process/thead could initialize with the max value that currently exists and simply increment the sequence for itself since the combination with the thread# would make the combination unique.
An other advantage - I think - is that it takes less space.

@ik_zelf wouldn’t that solution still in essence be a sequence, which would negate the randomness of using a UUID to “spread” the data over nodes, rather than having a sequence which is ordered?

yes @mark_balmer , it is but I think/hope this will give optimal insert speed for bigger volumes. All threads generate their own unique id’s and don’t interfere with each other. The spreading is done here by using multiple loader threads/processes, one per database node.

Hey @ik_zelf

I suppose the premise is similar to a UUID, just the data itself would be somewhat sequential. The sequences of your composite PK are still more likely to fall in similar ranges, versus a UUID. Though, as you mentioned the size of each value would most likely be less than a UUID. A UUID is simply built into CockroachDB to help prevent sequential values being read or written and to optimize performance.

So I’ve been doing some insert performance testing on previously mentioned cellevents table (defined in OP), and have got some interesting metrics.

So, I first tried a version of this table but with a UUID as the primary key, and a secondary index:

CREATE TABLE cellevents (
dist_key UUID NOT NULL,
ckm_partition INT8 NOT NULL,
revision_partition INT8 NOT NULL,
revision INT8 NOT NULL,
value DECIMAL(18,8) NOT NULL,
composite_key JSONB NOT NULL,
event_sequence INT8 NOT NULL,
CONSTRAINT cellevents_pk PRIMARY KEY(dist_key),
INDEX(revision_partition DESC, revision DESC, event_sequence ASC)
);

I then ran our load test on this and on the original table schema for various transaction batch sizes (having read that the Postgres jdbc driver prefers 2^n batch sizes):

As you can see the original out-performed the UUID table in every case, which I wasn’t quite expecting, but I’m ok with that - I can stick with the original table schema. It also seems that for us the optimal transaction batch size is 8192. I’d be interested in knowing what is happening to the UUID PK table for the 8192 batch size, it seems to fall off a cliff!

I then set out to see how far I could push the transactions/sec, so I modified our load test to use N threads to write the batches, and each thread connects to a different cluster node via a load balancer (mostly testing around the optimal batch size). The number of threads is set to be the same volume of cluster nodes:

As you can see the difference is enormous; it’s also encouraging that the throughput scales with the cluster nodes.

I’m going to see how much further I can push it, as I would like to see a throughput of ~120,000 cells/sec if possible.

Hey @mark_balmer

We will stand by to wait on the results of your further testing. Let me know if there are any other questions.