High Memory usage

I am seeing significantly more memory required than I expected. If it’s normal, that might be fine but if so I need to know how much to predict… The database is only a couple of gb, and running on 8gb of RAM per node. However, cockroach is taking over 11gb of ram (swapping).

12 node cluster (plus 2 dead nodes, which seems no way to remove).

The problem doesn’t appear to happen on all the nodes, and isn’t even being tested heavily yet. Recently increased replica count to 5 from default of 3, and started stopping and restarting some of the nodes with a locality set. That might of triggered the increased memory usage on some nodes, but not certain if it’s related, as I wasn’t expecting memory requirements to be that high from what little I read on sizing.

Hi @jlauro!

Sorry for the slow reply to this. We’ve all been heads down trying to get the release candidate out the door.

So for the dead nodes, we keep them around but once they are dead, all replicas from them should be added to other nodes.

So two basic questions to start:
What version of cockroach are you running?
Do you have some load on the cluster or is it idle?
If you do have some load, is it all going to a single node or is it distributed amongst the nodes? And what type of load?

Also, can you explain what your locality settings are?

Version: binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)

The cluster should be idle, but a lot of the nodes are showing load… It’s not busy from clients, but never seems to go idle, even after leaving it inactive for a long days (but it’s only been hours on rc.1).

$ dogroup “grep -F ‘[config]’ /data/crdb/logs/cockroach.log ; uptime” crdb+

Processing crdb1a
I170502 23:48:09.159014 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:09
I170502 23:48:09.159014 1 util/log/clog.go:990 [config] running on machine: crdb1a
I170502 23:48:09.159014 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:09.159014 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 14 days, 7:13, 5 users, load average: 1.18, 1.28, 1.31
Connection to crdb1a closed.

Processing crdb2a
I170502 23:48:09.477202 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:09
I170502 23:48:09.477202 1 util/log/clog.go:990 [config] running on machine: crdb2a
I170502 23:48:09.477202 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:09.477202 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 14 days, 7:15, 5 users, load average: 0.23, 0.14, 0.14
Connection to crdb2a closed.

Processing crdb3a
I170503 23:25:06.941979 1 util/log/clog.go:990 [config] file created at: 2017/05/03 23:25:06
I170503 23:25:06.941979 1 util/log/clog.go:990 [config] running on machine: crdb3a
I170503 23:25:06.941979 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 23:25:06.941979 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 14 days, 7:15, 12 users, load average: 0.94, 0.62, 0.54
Connection to crdb3a closed.

Processing crdb4a
I170503 22:17:01.035850 69 util/log/clog.go:887 [config] file created at: 2017/05/03 22:17:01
I170503 22:17:01.035850 69 util/log/clog.go:887 [config] running on machine: crdb4a
I170503 22:17:01.035850 69 util/log/clog.go:887 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 22:17:01.035850 69 util/log/clog.go:887 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 8 days, 4:20, 1 user, load average: 2.46, 2.56, 2.53
Connection to crdb4a closed.

Processing crdb5a
I170502 23:48:09.701236 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:09
I170502 23:48:09.701236 1 util/log/clog.go:990 [config] running on machine: crdb5a
I170502 23:48:09.701236 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:09.701236 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ohq]
01:41:36 up 8 days, 4:19, 2 users, load average: 1.21, 1.28, 1.40
Connection to crdb5a closed.

Processing crdb1c
I170504 01:35:23.501695 2577759 util/log/clog.go:887 [config] file created at: 2017/05/04 01:35:23
I170504 01:35:23.501695 2577759 util/log/clog.go:887 [config] running on machine: crdb1c
I170504 01:35:23.501695 2577759 util/log/clog.go:887 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170504 01:35:23.501695 2577759 util/log/clog.go:887 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=ods]
01:41:37 up 8 days, 1:36, 1 user, load average: 0.76, 0.49, 0.47
Connection to crdb1c closed.

Processing crdb1d
I170503 23:57:57.642682 1 util/log/clog.go:990 [config] file created at: 2017/05/03 23:57:57
I170503 23:57:57.642682 1 util/log/clog.go:990 [config] running on machine: crdb1d
I170503 23:57:57.642682 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 23:57:57.642682 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=dfw]
01:41:37 up 2:09, 2 users, load average: 1.68, 1.20, 1.01
Connection to crdb1d closed.

Processing crdb1e
I170503 23:36:48.702122 1 util/log/clog.go:990 [config] file created at: 2017/05/03 23:36:48
I170503 23:36:48.702122 1 util/log/clog.go:990 [config] running on machine: crdb1e
I170503 23:36:48.702122 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 23:36:48.702122 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=nyc]
01:41:38 up 2:07, 1 user, load average: 0.33, 0.26, 0.24
Connection to crdb1e closed.

Processing crdb1f
I170504 01:40:56.496240 138 util/log/clog.go:887 [config] file created at: 2017/05/04 01:40:56
I170504 01:40:56.496240 138 util/log/clog.go:887 [config] running on machine: crdb1f
I170504 01:40:56.496240 138 util/log/clog.go:887 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170504 01:40:56.496240 138 util/log/clog.go:887 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=atl]
01:41:38 up 8 days, 4:20, 1 user, load average: 1.26, 1.55, 1.77
Connection to crdb1f closed.

Processing crdb1h
I170502 23:48:12.067698 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:12
I170502 23:48:12.067698 1 util/log/clog.go:990 [config] running on machine: crdb1h
I170502 23:48:12.067698 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:12.067698 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=slc]
01:41:39 up 8 days, 4:20, 1 user, load average: 0.06, 0.09, 0.13
Connection to crdb1h closed.

Processing crdb2h
I170502 23:48:12.919771 1 util/log/clog.go:990 [config] file created at: 2017/05/02 23:48:12
I170502 23:48:12.919771 1 util/log/clog.go:990 [config] running on machine: crdb2h
I170502 23:48:12.919771 1 util/log/clog.go:990 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170502 23:48:12.919771 1 util/log/clog.go:990 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=slc]
01:41:41 up 8 days, 1:02, 1 user, load average: 0.50, 0.27, 0.31
Connection to crdb2h closed.

Processing crdb3h
I170503 00:40:25.760618 92 util/log/clog.go:887 [config] file created at: 2017/05/03 00:40:25
I170503 00:40:25.760618 92 util/log/clog.go:887 [config] running on machine: crdb3h
I170503 00:40:25.760618 92 util/log/clog.go:887 [config] binary: CockroachDB CCL v1.0-rc.1-dirty (linux amd64, built 2017/05/01 18:33:34, go1.8.1)
I170503 00:40:25.760618 92 util/log/clog.go:887 [config] arguments: [cockroach start --store=path=/data/crdb --locality=datacenter=slc]
01:41:42 up 8 days, 4:19, 2 users, load average: 1.18, 1.27, 1.30
Connection to crdb3h closed.

$ dogroup “cockroach node status” crdb1a

Processing crdb1a
±—±--------------------------±----------------±--------------------±--------------------±-----------±----------±------------±-------------±-------------±-----------------±----------------------±-------±-------------------±-----------------------+
| id | address | build | updated_at | started_at | live_bytes | key_bytes | value_bytes | intent_bytes | system_bytes | replicas_leaders | replicas_leaseholders | ranges | ranges_unavailable | ranges_underreplicated |
±—±--------------------------±----------------±--------------------±--------------------±-----------±----------±------------±-------------±-------------±-----------------±----------------------±-------±-------------------±-----------------------+
| 1 | crdb1a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:12 | 2017-05-02 23:48:32 | 1159428044 | 211872481 | 1185351247 | 128670 | 2261085 | 121 | 119 | 137 | 0 | 0 |
| 2 | crdb2a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:13 | 2017-05-02 23:48:32 | 1555609906 | 132124895 | 1584360723 | 0 | 1403991 | 30 | 30 | 30 | 0 | 0 |
| 3 | crdb3a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:08 | 2017-05-03 23:25:07 | 489122968 | 147093469 | 527490655 | 0 | 7143859 | 41 | 41 | 70 | 0 | 0 |
| 4 | crdb4a.ces.cvnt.net:26257 | beta-20170420 | 2017-04-26 00:42:15 | 2017-04-26 00:32:54 | 414197579 | 263132629 | 443540902 | 56 | 322549 | 65 | 65 | 80 | 0 | 0 |
| 5 | crdb5a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:07 | 2017-05-02 23:48:32 | 1254336608 | 154373631 | 1288792963 | 0 | 1483634 | 110 | 109 | 110 | 0 | 0 |
| 6 | crdb1c.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-02 23:48:11 | 1258763227 | 236980813 | 1302208729 | 7200 | 1623267 | 31 | 30 | 31 | 0 | 0 |
| 7 | crdb1d.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:43:49 | 2017-05-03 23:57:59 | 853076729 | 112511399 | 858024729 | 0 | 1068483 | 30 | 23 | 44 | 0 | 0 |
| 8 | crdb1f.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:07 | 2017-05-02 23:48:56 | 1343106828 | 204336461 | 1358733187 | 0 | 7534479 | 90 | 89 | 90 | 0 | 0 |
| 9 | crdb1h.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:07 | 2017-05-02 23:48:32 | 1045219625 | 170285764 | 1108588917 | 0 | 1492912 | 47 | 47 | 47 | 0 | 0 |
| 10 | crdb4a.ces.cvnt.net:26257 | beta-20170420 | 2017-04-26 00:37:06 | 2017-04-26 00:16:06 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | crdb3h.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-02 23:48:15 | 1697832015 | 269914723 | 1681637990 | 0 | 1783742 | 88 | 86 | 89 | 0 | 1 |
| 12 | crdb1e.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-03 23:36:50 | 837064696 | 58057801 | 818900087 | 0 | 1015956 | 28 | 27 | 50 | 0 | 0 |
| 13 | crdb4a.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-02 23:48:11 | 691725075 | 220581452 | 677119482 | 135870 | 2472971 | 104 | 103 | 104 | 0 | 0 |
| 14 | crdb2h.ces.cvnt.net:26257 | v1.0-rc.1-dirty | 2017-05-04 01:44:10 | 2017-05-02 23:48:13 | 893348467 | 144365663 | 911573214 | 135870 | 7976365 | 40 | 16 | 40 | 0 | 0 |
±—±--------------------------±----------------±--------------------±--------------------±-----------±----------±------------±-------------±-------------±-----------------±----------------------±-------±-------------------±-----------------------+
(14 rows)
Connection to crdb1a closed.

Cool. I’m going to try to recreate your scenario and see if I get the same load issues.

For localities, you have
5 nodes at ohq
1 node at ods
1 node at dfw
1 node at nyc
1 node at atl
3 nodes at slc

My hunch is, is that there’s some thrashing due to rebalancing of either the replicas or the leases since your setup isn’t very symmetrical. Something like 4 at ohq, 4 at slc and 4 at other might be a bit better.

Also, is there some large latency between these nodes at different locations?

Depends on your definition of large… some are at least medium going cross country… Our international locations would be larger latencies… All latencies are typically under 60ms RTT for the current test nodes. If we added international locations, then some of the largest node to node latencies would be over 160ms.

So I setup a cluster locally to see if I can get the same memory issues you’re seeing. Can you get the heap profiles from each node? Would you be ok if we moved this into an github issue and out of the forum? It’s a lot better for tracking: https://github.com/cockroachdb/cockroach/issues/15702

http(s)://<hostname>:<admin-ui-port>/debug/pprof/heap

Moving it is fine with me. I collected the heap info (all binary, is there a way to read it?) and uploaded the file to git. Only have a few hundred mb of data in the test database, but the store location is taking from 382mb to 2.8gb (and memory taking over 8gb…)

Cool. Got it. Will take a look shortly.