How to monitor cockroach process

We use an external monitoring system that our NOC monitors for alerts which we need to use for monitoring CRDB. It is primarily SNMP based with support for many other protocols like HTTP/TCP/DNS/Etc. but we were not sure how to proceed with using our existing system to monitor CRDB for uptime, whether the process is running, metrics about storage, etc. In the past we have written custom bash scripts that collect information and rely it via custom SNMP OIDs, but it would be nice if CRDB already implemented some API to collect monitoring data.

CockroachDB has a big write up about how to monitor their database.

@doane I have already read this, however we do not want to install an entirely new monitoring infrastructure, we would like to figure out how to use our existing tools to monitor CRDB.

Do you want to do https or something from the command line?

https://[admin UI]:8080/_status/vars

or from command line check out "node status"

Hi @somecallmemike, as @doane mentioned, even if you can’t use Prometheus, the HTTP endpoint it uses, <hostname>:8080/_status/vars, gives you time series metrics in an easy-to-parse format, so you might consider using that and massaging the data to work with your monitoring system.

There are also a number of raw JSON endpoints listed at <hostname>:8080/#/debug/. That and other debug pages are still under construction, though, and are not exposed by default. To access them, you must first run SET CLUSTER SETTING server.remote_debugging.mode = 'any'; in the built-in SQL shell.

@doane and @jesse awesome thank you! Those endpoints are definitely more useful for our existing system as it can connect via http(s) and parse the output appropriately.

Great to hear this works! Do you mind me asking what external tool you use? We are starting to think about which output formats to support. Do you have a preference?

By the way, you can also hit the /health end point for a basic health check

@dianasaur323 We use PRTG and Solarwinds as we also monitor a vast network infrastructure. SNMP is still a very useful protocol that is widely adopted by many industries, so I would say that providing some basic information like “i’m alive” or connection counts via SNMP would be super useful for monitoring. Otherwise JSON is the standard we’ve adopted for any kind of intrasystem communication as it’s parsable by everything under the sun.

It may not be exactly what you want, but there are tools out there for scraping prometheus metrics as json, e.g.

@somecallmemike perfect! thank you for the feedback. I’ll add it to our notes to help us formulate what this feature would look like going forward.