Is this insert performance normal?

I test crdb insert performance using sysbench in a 1 node cluster, 3 node cluster (3 replicas), and 5 node cluster (3 replicas). Every node has 1 store in SSD and running in different machine. I run sysbench on every machine at the same time.

Here are my results:
tps in 1 node cluster:4500 p95 latency:11ms
tps in 3 node cluster:5200(1700 per node) p95 latency:28ms
tps in 5 node cluster:7700(1500 per node) p95 latency:34ms

Hardware of my cluster:
CPU: Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz (36 core)
RAM: 64G
Hardisk: 1T HDD+800SSD(data stored in SSD)

Schema of my insert test:

CREATE TABLE sbtest1(
  id UUID DEFAULT gen_random_uuid(),
  k INTEGER DEFAULT '0' NOT NULL,
  c CHAR(120) DEFAULT '' NOT NULL,
  pad CHAR(60) DEFAULT '' NOT NULL,
  PRIMARY KEY (id)
)

Here is my sysbench command:

./sysbench --db-driver=pgsql --pgsql-host= --pgsql-port=26257 --pgsql-user=root --pgsql-password= --time=120 --report-interval=1 --threads=32 /home/fee/sysbench/share/sysbench/oltp_insert.lua --auto-inc=off --tables=1 --table-size=10000000 prepare

So the test runs in a single table with 10000000 rows prepared. And the insert SQL is simple:

INSERT INTO sbtest1 (k, c, pad) VALUES " .."(%d, '%s', '%s')"

I want to know:

  1. is this performance normal for crdb? Because when i test tidb,its 1 node performance is the same as crdb,but 5 node insert performance is about 15000+ qps.

  2. What’s more, I tried 2 stores (on different disk) on one node, but it seems one of these two disks doesn’t work at all. Can crdb scale by adding stores (on different disks) on one node?

  3. If i use a auto_increment primary key except a random one, its tps dropped from 4500 to 1300 (on a one node cluster). Is this normal too?

Thanks for your help!

Hi @floatsliang,

Taking the questions in reverse:

If i use a auto_increment primary key except a random one, its tps dropped from 4500 to 1300 (on a one node cluster). Is this normal too?

This is expected - if you’re auto-incrementing, then you’ll be sending nearly all inserts to the leaseholder of a single range, and transactions can only commit as fast as that leaseholder can process them. If the insert is random, you’ll be running inserts in multiple ranges in parallel.

What’s more, I tried 2 stores (on different disk) on one node, but it seems one of these two disks doesn’t work at all. Can crdb scale by adding stores (on different disks) on one node?

Can you send over the cockroach start command you’re using to start the node? You can have multiple stores on each node, and that should help some with scaling though it’s not clear whether it will drastically improve performance in this case.

is this performance normal for crdb? Because when i test tidb,its 1 node performance is the same as crdb,but 5 node insert performance is about 15000+ qps.

It’s hard to make general pronouncements about normal, but it’s surprising that there’s such a large difference in performance between CRDB and TiDB. Generally, performance is going to be a function of memory, CPU, network, and disk throughput so it’d be good to know what’s constraining your machines when under load. I’d recommend installing sysstat and setting it to log once per minute, and then sending over the results of sar -n ALL, sar -d, sar -u, and sar -m. It’d be helpful to see how TiDB and CRDB compare on those metrics when you’re running sysbench.

You can also try increasing the number of threads in the sysbench command to see if that improves throughput. If you double the number of threads, are you able to achieve faster throughput in CRDB?

Hi,@tim-o,

Thanks for your help,

For the auto_increment question,two insert statements have different primary key and seems would not cause contention between them.So why it does not process in parallel?

For the store question,here is my cockroach start command:
./cockroach start --insecure --host= --port=26257 --http-port=28080 --store=/ssd/fee/crdb/sdata --store=/home/fee/sdata --log-dir=/ssd/fee/crdb/slog --max-offset=500ms --max-sql-memory=.25 --cache=.25 &

For the scale performance question,I use iostat and top to monitor my machines.
when i test the one node cluster,its io util% is 80%(it seems io limited the performance),iops is about 8500,wkb/s is about 100M/s,cpu% is about 1100%.
when is test the 5 node cluster,its metrics is almost the same as the one node cluster.
So it seems ‘normal’ for the performance scale from 4500 to 7700(3 node to 5 node) as it has 3 replicas (4500x5=7700x3),but ‘abnormal’ for tidb scale from 4500 to 15000.
Maybe this difference is caused by the time Raft follower apply its log(so the follower node can have less io pressure by delay its log apply time)?

Thanks again for your reply!

Hey @floatsliang,

For the auto_increment question,two insert statements have different primary key and seems would not cause contention between them.So why it does not process in parallel?

The contention is not on the key but on the range. All transactions in CRDB are committed by a leaseholder for the range holding referenced keys. Ranges are split roughly once every 64MB by default. So if you’re auto-incrementing, your insert speed will be limited by how fast a single leaseholder and Raft group can complete the requested inserts.

Regarding TiDB / CRDB performance: after doing a bit more research, it seems that TiDB doesn’t actually support serializable isolation. All transactions in CRDB are serializable by default. So it’s a bit of an apples to oranges comparison; TiDB allows write skew and does not offer serializable isolation. CRDB does not allow write skew, and all transactions are serializable by default.

That said, using 12 sys bench processes each at 96 threads, we were able to actually get a test up to 11k TPS on a 3 node cluster:

.

Regarding the issue with stores: I’m not able to reproduce. This command worked fine for me locally:

 cockroach start --insecure --host= --port=26271 --http-port=28080 --store=Users/tim-o/test1 --store=Users/tim-o/test2 --log-dir=Users/tim-o --max-offset=500ms --max-sql-memory=.25 --cache=.25 &
[1] 1425
MacBook-Pro:~ tim-o$ *
...
CockroachDB node starting at 2018-07-24 18:20:54.6362998 +0000 UTC (took 0.3s)
build:               CCL v2.1-alpha.20180507-478-gcc4511e1b @ 2018/05/21 17:54:48 (go1.10.1)
admin:               http://MacBook-Pro:28080
sql:                 postgresql://root@MacBook-Pro:26271?sslmode=disable
logs:                /Users/tim-o/Users/tim-o
temp dir:            /Users/tim-o/Users/tim-o/test1/cockroach-temp094238707
external I/O path:   /Users/tim-o/Users/tim-o/test1/extern
store[0]:            path=/Users/tim-o/Users/tim-o/test1
store[1]:            path=/Users/tim-o/Users/tim-o/test2
status:              initialized new cluster
clusterID:           a855bacc-b4ed-4a59-86e9-d9f5b5751afe
nodeID:              1

Are you getting a specific error message back when you try to start the node?

Thanks for your help @tim-o,

Increasing threads num really help!I increased nums of threads of each sysbench from 32 to 256,then with 3 sysbench each at 256 threads,i have about 6000 auto increment insert tps(previously 1700 tps) on a 3 node cluster.

But there is another question :sweat_smile: ,why 32 threads each sysbench cannot make full use of crdb?When i play sysbench with 32 threads, the io util% is merely 17%,cpu% is about 800%.

As for the isolation problem,i think it would be useful for reads,but won’t help much for inserts.

For the store question,here is my log and admin ui metrics:

 I180725 08:02:47.614281 100 server/node.go:507  [n1] initialized store [n1,s1]: disk (capacity=733 GiB, available=685 GiB, used=17 KiB, logicalBytes=4.9 KiB), ranges=1, leases=0, writes=0.00, bytesPerReplica={p10=5061.00 p25=5061.00 p50=5061.00 p75=5061.00 p90=5061.00 pMax=5061.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}
 I180725 08:02:47.615956 100 server/node.go:507  [n1] initialized store [n1,s2]: disk (capacity=1.1 TiB, available=1.0 TiB, used=4.7 KiB, logicalBytes=0 B), ranges=0, leases=0, writes=0.00, bytesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}, writesPerReplica={p10=0.00 p25=0.00 p50=0.00 p75=0.00 p90=0.00 pMax=0.00}

It seems two stores initialized successfully,but when i bulk insert some rows into this node,one disk still doesn’t work:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     5.00    0.00    8.33     0.00    53.33    12.80     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00   113.00    0.00  283.33     0.00 53445.33   377.26     0.40    1.40    0.00    1.40   0.50  14.27

sdb is the SSD disk(store[0]),and sda is the HDD disk(store[1]),sda disk’s wKB/s is merely 53kb/s.

Really appreciate for your detailed reply!

Great to hear the increase in # of threads resulted in better throughput!

Also, I learned something today. Again taking the two questions in reverse:

It seems two stores initialized successfully,but when i bulk insert some rows into this node,one disk still doesn’t work:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     5.00    0.00    8.33     0.00    53.33    12.80     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00   113.00    0.00  283.33     0.00 53445.33   377.26     0.40    1.40    0.00    1.40   0.50  14.27

sdb is the SSD disk(store[0]),and sda is the HDD disk(store[1]),sda disk’s wKB/s is merely 53kb/s.

So I just spoke with some folks internally - we have a constraint that we will not put more than one replica on on a given node, even when more than one store is present. Otherwise, in a three node cluster, we could accidentally put two replicas on one node, and when the node drops, then ranges would become unavailable.

To get around this, you could either use LVM or a raid0 config to treat the two drives as a single store. Or, scale the cluster up past three nodes, and you’d begin to see the second store utilized.

why 32 threads each sysbench cannot make full use of crdb?When i play sysbench with 32 threads, the io util% is merely 17%,cpu% is about 800%.

I suspect each thread in sysbench is operating in parallel, but waiting for confirmation that its transaction completes before issuing the next transaction. So each thread waits for the nodes to get consensus on a write before attempting another write on a different range, when in reality CRDB can manage writes on multiple ranges in parallel.
The current limitation in the database is when writes pile up within a single range.

Not sure how you’re getting 800% CPU either. I assume that’s through sysbench and not a system monitoring tool?

Regarding the insert performance. There’s something else you should try:

  1. add multiple rows at a time per INSERT statement. As documented here: https://www.cockroachlabs.com/docs/v2.0/insert#insert-multiple-rows-into-an-existing-table and https://www.cockroachlabs.com/blog/multi-row-dml/ – this will have the most impact on performance.

  2. in some cases, you can gain a little additional throughput using parallel execution: https://www.cockroachlabs.com/docs/v2.0/parallel-statement-execution.html – this is valid if you are adding rows to multiple separate tables and the the rows must be added transactionally: this technique speeds up the INSERTs inside a single transaction.

Yes ,it works!

I start a 5 node cluster and attach two stores in one of these node,then data spread evenly from total 6 stores!