We have a 15 nodes cluster running on GKE kubernetes cluster. Replicas are being distributed evenly across nodes.
In our deployment, we’ve given
--cache 25% --max-sql-memory 25% as for memory allocation, but some nodes keep running into OOM, which means they hit the memory limits defined in the statefulset yaml file.
We are scraping the prometheus endpoints on each crdb node. What would be a good set of metrics to grab to understand its memory usage?
In logs, we see some memory metrics at runtime
[n1] runtime stats: 5.7 GiB RSS, 512 goroutines, 301 MiB/271 MiB/768 MiB GO alloc/idle/total, 4.0 GiB/5.1 GiB CGO alloc/total, 720.5 CGO/sec, 16.1/4.8 %(u/s)time, 0.0 %gc (0x), 1.5 MiB/1.3 MiB (r/w)net
Thanks for any recommendation that you may have!