Performance Bottleneck

Hi

We’re currently spiking cockroach as a DB solution and trying to understand what performance we can expect for a typical OLTP workload. In testing with the tcpp loadgen tool from here we seem to be hitting a performance bottleneck and can’t work out why or if the results we’re getting are expected.

We started by running a 3 node cluster in GCP behind an internal network load balancer. Here’s the details of our deployment:

GCP VM Setup

Cluster Node Count : 3 (1 per zone in europe-west2)
Machine Type       : n1-standard-4 (4 vCPUs, 15GB Memory)
Image              : debian-9
Boot Disk          : SSD persistent disk (10GB)
Additional Disk    : local-ssd (500GB)

Load Balancing

Load Balancer Type   : Network (TCP)
Load Balancer Scheme : INTERNAL
Session Affinity     : None

Storage

Disk          : local-ssd (500GB)
Filesystem    : ext4
Mount Type    : SCSI
Mount Options : discard,defaults,nofail,nobarrier

File Descriptor Limits

System-wide (/proc/sys/fs/file-max) : 1534232
Cockroach Process (systemd)         : unlimited

Cockroach Systemd Unit

[Unit]
Description=CockroachDB
Requires=mnt-disks-ssd.mount

[Service]
User=cockroach
PermissionsStartOnly=true
LimitNOFILE=infinity
ExecStartPre=/bin/bash -c "/bin/systemctl set-environment CLUSTER_JOIN=$(/usr/local/bin/get-cluster-join)"
ExecStart=/bin/bash -c "/usr/local/bin/cockroach start --insecure --store /mnt/disks/ssd --join $CLUSTER_JOIN --cache=50%% --max-sql-memory=50%%"

[Install]
WantedBy=multi-user.target

We’re running cockroachdb in insecure mode for now and it’s running as a systemd service. In testing with tpcc on the 3 node cluster we can’t push the queries per second past around 700 and the transactions per second tops out at around 60-70. Then we doubled the size of the cluster to 6 nodes and if anything the performance was worse. During testing we’ve been monitoring CPU, memory and disk I/O and they all look like they have very low utilisation i.e. there’s tons of headroom.

Can you see anything in our cluster setup that could be problematic? Based on our cluster specs, what tpcc parameters do you think would be sensible to push our cluster?

We’d really appreciate the help/advice.

Thanks!

Hi @anton, you’ve caught us right as we’re making a lot of changes to our tpcc load generator, and before we’re publicizing instructions on best practices. We’ll soon have a guide explaining how to run the load generator. Apologies that you’ve been caught in the middle.

Briefly, we’re testing with n1-highcpu-16 machines, as the n1-standard-4 are underpowered for this load profile. Our SQL processing is CPU intensive, and TPC-C involves fairly complex SQL queries, as opposed to some simpler KV-style loads like YCSB. Thus, n1-highcpu-16 machines are more appropriate. But the fact that you’re not seeing high CPU utilization (or high resource utilization at all) leads me to believe there’s a misconfiguration with how you’re running the load generator. Can you share what command you’re using to run your load generator?

For reference, we’re able to get the maximum throughput level (as defined by the TPC-C spec that says you can’t do more than ~12.8 tpmC/warehouse) with ./tpcc --scatter --split --warehouses=1000 on just 3 n1-highcpu-16 machines.

Hi Arjun
I noticed there were a few recent changes in the repo for the tpcc tool so I appreciate it may not be ready for general use. I guess we just want to get a rough idea of the performance we can expect.
We’ve tried running the tpcc tool in quite a few different configurations but tbh I didn’t really know what values to use. The last test I did was with warehouses set to 10 and scatter but I didn’t have the split option. I’ve also played around with concurrency & no wait but it’s not had a huge impact on our results. I’ve also tried running multiple instances of tpcc against the db and this probably got the best results although I’m not really sure if that’s a suitable way to run it.
Based on the spec I posted, what tpcc params would you suggest we try? And given a fixed set of params, should we expect roughly double the throughput if we double the nodes?
Thanks!

We don’t recommend running with --no-wait or --concurrency as that’s “not to spec”. It generates lots of contention without testing the storage scalability of the database (which is an interesting test in itself, but not the test that the TPC-C spec wants to test).

You should simply scale up your warehouses and see where you can push it to. The TPC-C spec is designed such that all the parameters (queries/sec, storage, etc.) are pinned to a multiple per warehouse. The database performing at max capacity should get 12.8*warehouses tpmC[1], which is the spec’s limit on throughput. If you want more throughput, you need to increase the warehouses. My rough memory of testing on n1-standard-4 is that your 3 node cluster will max out before 100 warehouses, but it’s been a while since we tested on those machines. A 3 node n1-highcpu-16 will get to 1000 warehouses.

As you scale nodes further, you should be able to push well beyond a thousand warehouses, but we’re still working on fine tuning performance in that range ahead of our upcoming 2.0 release. Speaking of which, are you trying the 2.0-beta? It’s performance characteristics are a lot better than our 1.1, with regards to TPC-C.

[1]: A “tpmC”, or “new order transactions per minute”, is a blended metric that roughly is a large multiple of QPS, but with stricter requirements, as the spec requires that you do a certain percentage of transaction rollbacks, etc. which wouldn’t be counted in a raw “QPS” metric. You should see the total number of all the transactions that ran in the final output as well.

Thanks @arjun really appreciate the help. We’ll try increasing the number of warehouses close to 100 and compare the results we get with that to the results we get with the 2.0beta. It’ll also be interesting to see how the instance types compare. I’ll post whatever we find back here for others that come across this discussion. Thanks!

3 Likes