Expected Production Metrics

deployment

#1

While adding the suggested monitors for a production deployment of Cockroach, I’ve noticed some oddities in the metrics being reported.

It’s very possible something is off with our deployment or database configuration, but first I wanted to get a sense of what numbers we should be targeting.

  • cockroachdb.exec.error - This is reporting 1000+ errors per minute (vs 5M successes). I don’t see any errors being logged from CDB, so it seems off for this to be so high.

  • cockroachdb.liveness.heartbeatfailures - Seeing about 30 heartbeat failures per minute. I’d imagine that’s higher than expected?

  • cockroachdb.sql.txn.rollback.count - 100+ per minute. Normal?

Thanks!


(Ron Arévalo) #2

Hey @algerithm,

Expected values can vary depending on a few different factors.

Could you let us know the following:

  • How many nodes are in your cluster?
  • Is your cluster running on a single machine in a single datacenter, or is it geo-distributed?
  • Can you share any other specific details about how you set up your cluster?
  • Can you share your workload and schema?

Regarding the heartbeat failures, the seem to be inline with with what we’d expect, especially for distributed systems. We see similar values in healthy clusters. For the rollbacks, we use rollbacks to handle client side retry errors which might be caused by contention in a workload, so learning more about your workload might shed some light on why those rollbacks are happening.

The exec.error also doesn’t seem out of the norm from what we can see on our production clusters. But I’ll need to check with some of our engineers who work on the core team in order to get some more clarification on that.

Thanks,

Ron


#3

Thanks for the quick reply, @ronarev!

We’re using 3 nodes in our cluster, which are hosted on DS13_v2 VMs on Azure AKS in a single datacenter.

For the kubernetes setup, we have no resource limits set and memory requests of 6GB for each pod.

Our workload is pretty light at the moment. We serve about ~20 requests per minute with the occasional spike to 400 requests. Each request usually results in a couple INSERT statements, but according to our Admin Dashboard, none of those statements appear to take more than 100 milliseconds.

Let us know if there’s any other setup that would be useful to know.


(Ron Arévalo) #4

Hey @algerithm,

So after speaking with our core team engineers, we don’t have any solid metrics around this because these errors are based more on how much contention exists. For example, we run a production cluster that runs a high contention workload that sees 300 exec.errors/sec vs about 2000 successes. Also if you are seeing 5M successes when running about 400 queries then it sounds like you may have very large queries, and generally speaking, the larger the queries then the more contention exists. If you’re running larger queries, you can try using AS OF SYSTEM TIME in those larger queries. If you could provide a few sample queries to take a look at that would help us determine where contention is coming from.

Thanks,

Ron


(Ron Arévalo) #5

@algerithm,

Just to follow up with a few more resources, this guide also has some more info on how to manage long running queries more efficiently.


#6

Thanks for the link, @ronarev.

I thought there may be something with our database schema that may increase contention, but I compared these numbers to our dev environment and am seeing the same ratio of successes to failures (9M successes v. 5,000 failures) there.

This is particularly striking because we haven’t run any statements against our development database in over a day (at least according to the “Statements” page in the Admin Console).

Perhaps it’s not related to the efficiency/contention of our queries, but something else?


(Ron Arévalo) #7

Hey @algerithm,

Could you get a screenshot of the count for txn.autoretries and exec.errors, you could get these by going to http://<HOST>/#/debug/chart. Specifically around the time you saw these metrics.

Thanks,

Ron


#8


(Ron Arévalo) #9

Hey @algerithm,

We dug into the code a bit from our end, what we can see is that we throw this error for a few different reasons, sometimes it’s due to range splits, sometimes it’s due to SQL transactions trying to insert rows and hitting conflicts. However at the moment, we don’t have any clear way of pinpointing what is causing these errors for your specific cluster.

Thanks,

Ron