Slow local (and remote) CockroachDB vs PostgreSQL

Hello,

After porting our DB schema from PostgreSQL to CockroachDB for testing we see some very slow load times on the UI when the database is empty (only the schema and some meta-data is inserted 2-3k rows).
One UI page would load in 6.5s whereas with Postgres the load time is definitely less than 1s.

So I set out to try and compare speeds on something that isn’t already constrained by our schema design choices (and maybe faulty queries on a distributed system) and started from the bank workload provided by cockroach and the JOOQ demo: Build a Java App with CockroachDB and jOOQ | CockroachDB Docs

I modified the example slightly to create 1000 accounts and move an amount from one account to the next (so I transferFunds a thousand times) , basically I’m doing this:

for(int i = 1; i<1000; i++)
{
    // Transfer $100 from account 1 to account 2
    long transferResult = runTransaction(ctx, transferFunds(i, i-1, transferAmount));
    if (transferResult != -1) {
        // Success!
        System.out.printf("APP: transferFunds(%d, %d, %d) --> %d \n", fromAccountId, toAccountId, transferAmount, transferResult);

        long fromBalanceAfter = runTransaction(ctx, getAccountBalance(fromAccountId));
        long toBalanceAfter = runTransaction(ctx, getAccountBalance(toAccountId));
        if (fromBalanceAfter != -1 && toBalanceAfter != -1) {
            // Success!
            System.out.printf("APP: getAccountBalance(%d) --> %d\n", fromAccountId, fromBalanceAfter);
            System.out.printf("APP: getAccountBalance(%d) --> %d\n", toAccountId, toBalanceAfter);
        }
    }
}

Now, the painful part, run times:
CRDB Remote cluster with 6 (fairly powerful) nodes (num_replicas=5): 7 minutes
CRDB Local cluster with 3 on-the-same-machine nodes (num_replicas=3): 7 minutes
CRDB Local 1 node, freshly started: 2 minutes
PostgreSQL local 1 node: 7 seconds

I must note that on the CRDB tests we’re using SSL for the connection and on the postgres tests we’re not.

Versions:
JOOQ 3.14.8
JDBC PostgreSQL driver 42.2.19
Cockroach 20.2.5 (local) / Cockroach 20.2.6 (remote cluster).
PostgreSQL 11.10.

Can somebody point out an obvious mistake, why is it so slow? I cannot imagine cockroach (even with a single node) being 17x slower than postgres, even with a SSL connection, too much.

Thanks in advance for any insight.

Hi @paullik! These are interesting findings; thanks for sharing. As a general rule, a local PostgreSQL server will certainly perform better than CockroachDB, but you are right maybe a 17x difference is worth exploring.

The first thing that stands out to me is the difference in the default transaction isolation levels between PostgresSQL and CockroachDB. PostgreSQL runs in the “read committed” isolation level by default, whereas CockroachDB uses “serializable” isolation. Serializable transactions offer a stronger level of transaction isolation that prevents write-skew anomalies, but at the cost of having to do slightly more coordination (especially in a multi-node cluster). Check out these docs. One test you could try is to run your test against PostgreSQL using serializable isolation.

Another note is that inserting sequential data into CockroachDB will probably be slower than PostgreSQL. Are you including the time it takes to create the accounts in the time measurement? See the best practices for primary keys for a bit more info.

Finally, have you tried exploring the DB console? It allows you to see the performance and latency of specific queries, and may offer you more information about what is taking a long time. DB Console Overview | CockroachDB Docs

Hi, @rafiss,

Yes, I am including the insert times in the final result, but the insert (all 1k records) is done as a batch, which should already be faster than executing 1k separate inserts. (The Postgres and the CRDB tests differ only in the JOOQ SqlDialect and the connection URI).

While I agree there is a primary key that I’m populating with sequential numbers, still there are only 1k records (each consisting of two bigints = 128bits). Will this create such a big hotspot? Considering the fact that the test is (very) primitive.

I tried using the DB console to see which query takes longest (filtered by app), but unfortunately nothing stood out, all latency numbers were fine from there (really nothing over ~100ms) as far as I can remember, although the app had those slow run times. Any specific pointers with regards to what numbers should I look for?

I will try serializable isolation in PostgreSQL.

Hi @rafiss, I also ran the test against PostgreSQL using serializable isolation, the run times have not changed.

I’m back with hard numbers.

This time I tried to use a primary key of type UUID and the table looks like this:

create table accounts (
  pk UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  id bigint not null,
  balance bigint not null
);

The Java code is here (very little modifications from the official example):

The run time was ~8 minutes (measured via time in bash), broken into these parts (measured in the code itself):

init: 1101.101 ms (1.1011009 s)
delete: 319.73267 ms (0.31973267 s)
insert: 1495.7289 ms (1.4957289 s)
select+update: 526293.75 ms (526.29376 s)

The init is just the connection to the DB. The delete part cleans the previous enrties in the accounts table, the insert part batches 1k records and inserts them.

The select+update part is the chunky one, in total 4 selects and 2 batched updates, executed 1k times in transactions(one transaction consisting of 2 selects and one of 2 selects and the 2 batched updates).

This test was run on our cluster (v 20.2.6).

For me this test shows that the insert part is insignificant, so is the UUID vs serial PK part.
Also the DB Console was not of much help, it even adds to my confusion, the select & the update, according to the below screenshots should’ve run in only a fraction of the time (let’s say ~1 minute).

The summary for my test app, nothing stands out:

Here’s an Imgur album with the individual statements (as a new forum user I can only upload one media file):

Where is the time lost (~1minute in the DB console vs 8 minutes actual run time)?

Any further pointers?

Thanks for coming up with this detailed report! My next pointer would be to take a look at the transactions page on the DB Console. I wonder if the individual statements are relatively fast, but contention of some sort is causing the overall transaction to take longer.

Also, do the metrics charts reveal anything about the SQL latency?

Nothing on the transactions page (filtered by the app) seems out of place.

And the metrics charts stay mostly in the tens/hundreds of milliseconds, apart from Transaction Latency: 99th percentile, which is 2.42s at the start of the test, but that clearly corresponds to the inserts, so that’s fine. Other than that, again, no obvious explanation on the slow run time.

I uploaded photos of the relevant pieces here (transactions and metrics): Transactions & SQL metrics - CockroachDB slow run times - Album on Imgur

For what it’s worth, I also tried the test using a simple SQL file and ran it via cockroach sql, the same run times are seen there as well, so it’s not something in the Java layers, definitely Cockroach.

Hello,

I wonder if this is being investigated or looked into.
Should I move the discussion to a more suitable place (github, email)?
Where should I post so that we involve the right people to look into this?

The java code appears to log when retries are occurring. Do you see evidence of retries in the application logs? Also could you share your a screen shot of the hardware metrics page?

As for your UUID changes, I might suggest a different approach. While we do want primary key’s to be a UUID which you have done, the intent was to make the “id” column the UUID. In cockroach primary key columns are automatically indexed which leads to fast lookups. While the new UUID “pk” column likely solved the hotspotting issue, all of your selects are now full scans. This may not be problematic at such a small scale but worth noting for the future. You might try adding an index to the “id” column to see if that helps.

Hello, thanks for your response.

No retries are being triggered.
I also successfully tried to reproduce this by running:
time cockroach sql --certs-dir=.../certs --host=name.host -d bank < db-full-test.sql

Where the db-full-test.sql file is: CRDB Minimal Complete Verifiable Example · GitHub

With this I see similar run-times as in Java.

The HW metrics screenshot is (the test was started at 6:12):

No particularly high metric shows up, the servers are not the bottleneck.

The results are similar if I create an index on the ID column (as the above SQL file shows).

Running these tests on CockroachCloud Free (beta) provided by you, the results are twice as worse, run times of around 18 minutes (understandably, since there is 1 vCPU, but still 20 minutes for this test case is bad).

Paul i’m having trouble recreating your results. I took your Jooq code and ran it against a 3 node docker cluster running locally. You said you saw this take 7 minutes. I’m seeing this take 21 seconds which is inline with what i’d expect given that you said a single instance postgres took 7 seconds. I have all my code running in this github repo: GitHub - timveil/crdb-jooq-bank. I don’t have a readme written but it should be trivial for you to run… simply run ./up.sh in the docker folder to start a 3 node secure cluster. Then you can run the Sample application from either the command line or an IDE.

I did make make a few changes to the code…

  1. I added proper logging instead of system.out
  2. the accounts table has an id/primary key as UUID instead of a long
  3. i regenerated the jooq code based on this new schema

you should see output like this when its done…

[main] INFO com.cockroachlabs.example.Sample - inserted 1000 accounts in  197 ms
[main] INFO com.cockroachlabs.example.Sample - completed 1000 business transactions in 21907 ms or 21 s; avg per transaction = 21 ms

The UUID id/pk combination is a time saver, but that’s not the culprit. I saw a 2-3 minutes reduction on our cluster - yielding a final time of 5 minutes.

I was indeed able to reproduce your results (20-30 seconds) on my machine with the 3 docker containers provided by your compose file.

I ran your application against our cluster and the run-times jump back to 5 minutes. [1]
Your application also runs on a local cluster (3 cockroach processes on the local PC, no docker) in 5 minutes. [2]

Now both the servers in the CRDB cluster and my machine are running Ubuntu 20.04 LTS - I suspect some settings may be off here since your cockroach images are based off of uib8 from redhat as far as I can see.

I also tried running the 3 node cluster in docker with your app on one of the cloud servers and the results are not great there either: 3 minutes. [3]

Hardware-wise, the server I tried running the 3-node docker test on is a dedicated 4 core Intel i5 CPU with 16GB of RAM, my PC has an i7 CPU with 8 cores and 16GB RAM.

So as far as I can tell from [1], [2] and [3] it’s either something from the OS or something having to do with a HW limitation (or a combination of these).
Should I try another Linux distribution?
What settings might be so different that we see this massive time reduction in the docker cluster?

I ran your application against our cluster and the run-times jump back to 5 minutes. [1]

it sounds to me like networking. with the application and the nodes all running locally networking latencies between the app and the nodes is removed. When you are running the app locally and the nodes are in a remote cluster are you injecting network latencies. where is the cluster located relative to the application at this point? can you try deploying the application into the same location as the cluster and test performance that way? You always want to have your app as close as possible to the nodes and the network between them to be very fast.

Hardware-wise, the server I tried running the 3-node docker test on is a dedicated 4 core Intel i5 CPU with 16GB of RAM, my PC has an i7 CPU with 8 cores and 16GB RAM.

What kind of hard disk do you have? This workload is fairly IO bound. Running on my mac with NVMe storage is fast. Do these machines have spinning disks? What IOPS can they achieve? We have good reference here for optimal hardware configuration… Production Checklist | CockroachDB Docs

And one more very important thing… this is an incredibly small workload meaning that all the data fits into a single range. Give our architecture this means that a single node (the leaseholder) in the 3 node cluster is essentially acting as the bottleneck… doing all the read and write operations for the range. One thing we always look for are hotspots but given the size of the data a hotspot here is unavoidable (unless of course we shrunk the range size). This is a long way of saying CockroachDB really shines when the data is large enough that the work can be shared across the cluster.

I moved it just now and indeed if I place the app on the load balancer node and run it from there the run time drops to 40 seconds.
It’s a very big improvement considering the amount of data transferred is not that high.

This was my initial idea as well, but I tried to run the app with 3 cockroach processes on my PC only, no docker, and it still ran for 5 minutes. I wonder why running the cluster in docker on the local PC is faster - this is the main point of confusion for me.
Also seeing that your HAProxy configuration is similar to the default one (generated via the CLI), together with the fact that we’re not dealing with huge amount of data, quickly prompted me do dismiss the idea of networking problems.

Spinning disks, not sure about IOPS (how would you recommend that I measure this?).
But then again, why would it run faster on a local docker cluster than a local cluster formed by spawning 3 processes by hand, both cases backed by the same HDD?

I fully agree with this.

So I re-ran the test on one of the servers, completely clean this time:
The app, connecting locally to the docker CRDB cluster, took 4 minutes 22 seconds to run (4 minutes being used by the 1k transactions).

This result together with the slow run times of the 3 node cluster spawned manually on my PC, contradicts the network hypothesis.

What could be the reason I see fast run times in some configurations (my PC + docker + local app, CRDB on-prem cluster + local app) while in some configurations (server + docker + local app, manually spawned processes on my PC + local app, CRDB on-prem cluster + remote app) I see slow run times.

I’m not really sure @paullik it really sounds like something environmental or related to hardware… perhaps disks. In my github example i increased the transaction count to 10k and ran it in like 2 minutes. Do you have a production grade workload and server configuration you are considering?