Problem with one testing!

I’m doing a PoC in the next env:
K8s = 1 Master and 2 Workers [all Ubuntu 18.04 server[8 CPU and 8Gb per VMs] in VMware, and new and clean VMs ]
Install a Crdb under this instruction [Deploy CockroachDB in a Single Kubernetes Cluster | CockroachDB Docs ].
I’m start the benchmarking [Performance Benchmarking with TPC-C | CockroachDB Docs].
In my env, I got my HPA[Horizontal Pod Autoscaling ] with metrics server [k8s metrics servers]
But when I deleted one pod the conexion was
And everything is working, but when I delete one Pod, the connection is reset and the test line is lost, I must launch the script again to perform the tests [cockroach workload run tpcc --warehouses=10 --ramp=3m --duration=10m ‘postgresql://root@localhost:26257?sslmode=disable’].
This is not good because if my environment is in production, what happens to the data?
Any ideas or suggestions !!! ?
Please!! and thanks for read my problem

Sorry just clarifying a few things, was the sentence above cut short? Or should be removed?

Is your goal to kill off a pod while having the test continue?

Also can you clarify more about your setup? Do you have one CRDB node on each VM?
Which node are you running the test on / connecting to?

Hi Richard.

Before nothing thanks for your answer.

It is a simulation of failure, when killing a pod, I am simulating a failure in my cluster. It doesn’t matter if I remove it or kill it, because my intention is to simulate a failure in my work environment.

Is it possible that the workload you’re running is directly connecting to the node that you’re killing?

I suggest trying this PoC with a load balancer.

Hi Richard!!.

I have it, I use Metallb. and in theory the metallb would or should be balancing the traffic. It does, but cockroach does a connection reset, that’s the only part we find inside of what we DO NOT need to happen.

Any other possible configuration in which this situation can be avoided, we need cockroach to continue writing data without restarting the connection, even if it is left with a single worker.

We need to put into pre-production, but with this problem we are not entirely sure that it is stable, resilient and highly available.

Kind regards

Emilio

@emipy I’m not sure if this is what you are asking, but by default the workload tool will quit if it encounters an error. If you’d like to do failure testing than you can use the --tolerate-errors flag when running the test.

It makes sense that you would see at least one “connection is closed” error, since you are terminating a node ungracefully while it is being used.

Could you clarify a few things?

  • Where are you running the workload program?
  • Can you share the connection string that your workload is using?
  • If you view the DB Console, can you verify that all the nodes in the cluster are receiving traffic before you kill one pod?