Reducing memory usage for table with millions of rows

I’m running a single-node cluster of Cockroach to power Caddy’s telemetry data, but we’re on a tight budget since we don’t make any money from telemetry so I’m trying to keep the machine as lean as possible.

The vast majority of telemetry is inserts, but with a few very small (one-row) selects. Originally, CPU use was off the charts, until I started batching the inserts. Now the inserts happen on the order of ~10-2000 rows at a time (2000 small rows, or at most about 100 larger rows, no more than 1 MB total). Although CPU usage is more manageable now, I was surprised to see that memory usage continues to skyrocket.

I’ve followed the recommendations for running in production by setting --cache-size=.25 and the other things in the guides for deployment – great docs, by the way, thanks for those – but no luck.

Just to see how high it could get, I tried increasing swap space to like 6 GB and it still fills it all up. So I’m not sure that adding more memory will even solve the problem.

I did come across this article which says:

Finally, note how how the formula above has the number of nodes as divisor: adding more nodes, keeping all parameters equal, will decrease memory usage per node. This property can be used advantageously when running off cloud resources, where memory is often not priced linearly: adding 10 new nodes with 2GB of RAM each is usually cheaper than switching five existing nodes from 2GB to 6GB each. Scale can lower your costs!

Can anyone here independently verify this? I’d be willing to double our spending on telemetry for now if I can know it will work. :slight_smile: (Also I can’t afford to 10x our costs, so adding 10 nodes is out of the question. I admit it is a fairly small node. Minimum recommended.)

Any other recommendations? Specifically, any quick wins to reduce memory use? Slightly slower performance is OK. I am happy to provide more information, just tell me what to do; a bit new at this still.

Hi Matt

thanks for your interest in CockroachDB and for pushing us to improve our resource usage story.

To be clear / disclaimer, even if we were able to provide an answer to your question (I will attempt to do so below), there are still several known and unknown resource usage bugs in CockroachDB (memory leaks, bad accounting, lack of enforcement of limits and whatnot), whereby anything we can answer needs to be taken with a grain of salt. We cannot provide you with 100% confidence that “it will work” at this time.

The flip side is that if you find anything that doesn’t match expectations (or our explanations) we would be keen to work with you and investigate more precisely what is going on, and, hopefully, improve CockroachDB as a result.

Now, on to your question.

Since the blog post on memory usage was posted we have been expanding CockroachDB’s usage of memory to track more things on each node:

  • in-RAM statistics about which queries were executed. We try to group the data together so that we use only one unit of storage if the same query is issued twice, but it’s possible that your particular use case does let this thing kick in efficiently.

  • there are various in-RAM queues that grow with the total number of ranges. At a fixed number of ranges, these should become smaller if you increase the node count, but if you keep on adding more data they will still grow nonetheless.

  • distributed query execution, combined with multiple client queries, can make memory usage “balloon up” in an unexpected way. If you have N nodes, and each node accepts M queries, and all these queries are distributed to all N nodes, then each node has to support the execution of M*N queries simultaneously. In that way, adding more nodes (while keeping data very well distributed) may increase (not decrease) per-node RAM usage.

  • we have increased our usage of in-RAM data structures to optimize transaction conflict resolution on each node. With the more concurrent transactions, depending on how/whether they attempt to touch/write the same ranges, we may be using more of these data structures.

And perhaps most importantly:

  • since CockroachDB 2.0 we have added a new feature called the “heap profiler”. This enables you to collect (in a file) a snapshot of which objects in CockroachDB are causing RAM usage. We can help you analyze the contents of this file to determine how your RAM is being occupied.

As to quick wins, unfortunately I do not have direct advice to offer (other than reducing the number of concurrent queries per node) but I will ask who else in the team understands current memory usage (and how to analyze the heap profiles).

I hope this helps, and stay tuned for more.

1 Like

I am being told that you can use this on a running CockroachDB instance:

go tool pprof https://<addr>:8080/debug/pprof/heap?debug=1

Hope this helps

1 Like

More details from @bdarnell:

if the server’s running you can just go to /debug/pprof/ui/heap/1/flamegraph on the http port in your browser
you need go tool pprof and the cockroach binary to use the files that are auto-written to heap_profile
also at least on GCP, memory is priced linearly these days, so adding memory to your nodes is cheaper than adding new nodes with memory and cpu

1 Like

Great, thank you for the answers! I’m glad this is something the team is aware of, and hoping it will find some ways to resolve it, or at least make memory usage more tunable, in the future.

I’ll play around with the pprof stuff (I also am a Go programmer so I’m at least mildly familiar with pprof, but definitely not an expert at its tooling) – that should give me a little more insight.

Nice tip about the linear pricing, that’s good to know. (Although, as I said, even if I quadruple the RAM it will still run out. Sigh…)

1 Like