While adding the suggested monitors for a production deployment of Cockroach, I’ve noticed some oddities in the metrics being reported.
It’s very possible something is off with our deployment or database configuration, but first I wanted to get a sense of what numbers we should be targeting.
cockroachdb.exec.error - This is reporting 1000+ errors per minute (vs 5M successes). I don’t see any errors being logged from CDB, so it seems off for this to be so high.
cockroachdb.liveness.heartbeatfailures - Seeing about 30 heartbeat failures per minute. I’d imagine that’s higher than expected?
cockroachdb.sql.txn.rollback.count - 100+ per minute. Normal?