Client errors rate

I’m trying to figure out how to reduce number of client errors that appear while running pgbench performance tests.

Testing environment:

  • CockroachDB cluster v21.1.5, 8 bare-metal servers, each server has 16 CPU, 64 GB RAM, Ubuntu 20.04 and NVMe disks.
  • Mean network latency between nodes is 6ms.
  • pgbench v10.17 runs on a separate server.
  • pgbench connects to only one node of CockroachDB cluster. No load balancer.

Below is a snippet of my bash script. Iterations are added as not every single run of pgbench returns errors.

for i in {1..10}; do
    psql "${CONNECTION_STRING}" -c "CREATE DATABASE pgbench"
    pgbench --init --scale 1 --no-vacuum "${CONNECTION_STRING}"
    pgbench -r --jobs=16 --client=128 --protocol=simple --time=60 -b simple-update --scale=1 --no-vacuum "${CONNECTION_STRING}"
    psql "${CONNECTION_STRING}" -c "DROP DATABASE pgbench"
done

During that benchmark test, about 600 errors appear (I guess the statistics depends on cluster settings). There are two types of errors:

  • RETRY_ASYNC_WRITE_FAILURE
  • ABORT_REASON_NEW_LEASE_PREVENTS_TXN

Interestingly, on 3-nodes Docker cluster only ABORT_REASON_NEW_LEASE_PREVENTS_TXN errors appear, while on 1-node Docker cluster neither of these errors occur.

I guess the error rate somehow depends on the cluster settings. Is there any cluster/replication parameter which makes above mentioned errors less likely?

I would appreciate any suggestions.

Hello @tjel,

Our docs on Transaction Retry Errors may help explain what kinds of errors you are seeing. It seems like these errors might occur when the nodes try to replicate (which is why you may not be seeing errors when there is only one node). I’m afraid I don’t know much about pgbench, so I can’t say if these kinds of errors or volume are expected.