We’re observing some strange issues with our crdb installation (kubernetes-based). We have a GCP kuberentes pool of 15 nodes (4CPU, 15GB each), with 1TB SSD disk attached to each node.
The cluster was recently created from scratch and I’ve started data migration process, which simply copies data from our current storage system.
In terms of data layout there’s just one database with couple of thousand of tables, each containing from thousands to hundreds of millions of records.
After some time I noticed, that there’s a significant replica count imbalance between nodes:
Not all nodes can be seen on the screenshot so I’ll just list all of them with their corresponding replicas:
n1 - 12914
n2 - 4173
n3 - 4170
n4 - 4173
n5 - 4172
n6 - 4177
n7 - 5288
n8 - 5289
n9 - 4173
n10 - 5290
n11 - 12789
n12 - 13109
n13 - 4178
n14 - 2660
n15 - 4175
There seems to be three nodes (n1, n11 and n12) with significantly higher number of replicas.
I tried setting:
set cluster setting kv.snapshot_rebalance_max_rate=‘8MiB’
set cluster setting kv.snapshot_recovery.max_rate=‘32MiB’
but there’s no visible change after this.
Is there a way to enforce replica rebalance somehow?
The replication factor is 3 and I’ve increased range_max_bytes
to 256MB.
Another issue (which might be related to the imbalance) is that certain nodes are sporadically being restarted by k8s with OOM reason, n1, n2, n11, n14 were restarted in the last couple of hours.
n1 and n11 are those with high number of replicas but n2 and n14 are not so I’m confused.
cockroach process is started with --cache 25% --max-sql-memory 25%
and there’s a dedicated node for every cockroach stateful set instance. Is there a way to debug this issues somehow?
Thank you!