Minikube Node Restarts


(Jesse Ezell) #1

Trying to give Cockroach cluster a try on Minikube. I configured with the stable cockroach db chart + helm. It generally works, but the containers in the cockroach cluster restart every few minutes, which then causes connection errors. Is there a special trick for running in minikube?

Describing the pod lists this:

Warning Unhealthy 10m (x233 over 1d) kubelet, minikube Readiness probe failed: Get http://172.17.0.6:8080/health?ready=1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning Unhealthy 2m (x109 over 1d) kubelet, minikube Liveness probe failed: Get http://172.17.0.6:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)


(Jesse Ezell) #2

Looks like maybe the issue is memory pressure leading to extra latency on the health checks and causing pod recycling. Doubling up the RAM I have devoted to minikube to see if that solves the issue.


(Jesse) #3

Hi @jezell,

We haven’t tested minikube with helm very much, but we do have a minikube tutorial. Would you be willing to give that a try?

Best,
Jesse


(Jesse Ezell) #4

Thanks, I’ll check this out. Is this configuration different than what’s in the stable/cockroachdb helm charts?


(Tim O'Brien) #5

@jezell - not appreciably, and neither config on our side limits the amount of RAM available to each pod. Were you specifying the resources available to minikube while the issue was happening?


(Jesse Ezell) #6

Even after bumping up minikube RAM to 16 GB and switching to the chart linked above, still getting fairly regular liveness and readiness check failures that cause the pods to be killed by Kubernetes.

2m 3m 3 cockroachdb-1.1557c9ea3fd02b0b Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Liveness probe failed: Get http://172.17.0.7:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m 3m 3 cockroachdb-2.1557c9ea4258e827 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Liveness probe failed: Get http://172.17.0.8:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m 3m 4 cockroachdb-0.1557c9dd4cc46100 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Liveness probe failed: Get http://172.17.0.6:8080/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2m 24m 4 cockroachdb-2.1557c8bf1eef17c8 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Readiness probe failed: HTTP probe failed with statuscode: 503
2m 24m 4 cockroachdb-0.1557c8bef764dbd0 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Readiness probe failed: HTTP probe failed with statuscode: 503
1m 24m 5 cockroachdb-1.1557c8bf7462a335 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Readiness probe failed: HTTP probe failed with statuscode: 503
1m 3m 12 cockroachdb-2.1557c9ea8a5b9556 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Readiness probe failed: Get http://172.17.0.8:8080/health?ready=1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
1m 3m 13 cockroachdb-0.1557c9e938c2a775 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Readiness probe failed: Get http://172.17.0.6:8080/health?ready=1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
1m 3m 13 cockroachdb-1.1557c9de115915a4 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Readiness probe failed: Get http://172.17.0.7:8080/health?ready=1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
1m 1m 1 cockroachdb-0.1557c9fb9d863c6e Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Readiness probe failed: Get http://172.17.0.6:8080/health?ready=1: dial tcp 172.17.0.6:8080: getsockopt: connection refused
1m 1m 1 cockroachdb-2.1557c9fbc515e3a1 Pod spec.containers{cockroachdb} Warning Unhealthy kubelet, minikube Readiness probe failed: Get http://172.17.0.8:8080/health?ready=1: dial tcp 172.17.0.8:8080: getsockopt: connection refused
1m 1m 1 cockroachdb-2.1557c9fbf45f7a19 Pod spec.containers{cockroachdb} Normal Pulled kubelet, minikube Container image “cockroachdb/cockroach:v2.0.5” already present on machine
1m 1m 1 cockroachdb-1.1557c9fbef3ae08e Pod spec.containers{cockroachdb} Normal Pulled kubelet, minikube Container image “cockroachdb/cockroach:v2.0.5” already present on machine
1m 1m 1 cockroachdb-1.1557c9fbef26e35a Pod spec.containers{cockroachdb} Normal Killing kubelet, minikube Killing container with id docker://cockroachdb:Container failed liveness probe… Container will be killed and recreated.
1m 1m 1 cockroachdb-0.1557c9fbec464740 Pod spec.containers{cockroachdb} Normal Killing kubelet, minikube Killing container with id docker://cockroachdb:Container failed liveness probe… Container will be killed and recreated.
1m 1m 1 cockroachdb-2.1557c9fbf444ca6f Pod spec.containers{cockroachdb} Normal Killing kubelet, minikube Killing container with id docker://cockroachdb:Container failed liveness probe… Container will be killed and recreated.
1m 24m 2 cockroachdb-2.1557c8bc6f840dfa Pod spec.containers{cockroachdb} Normal Created kubelet, minikube


(Tim O'Brien) #7

Hey @jezell,

What do the node logs say? Are there any warnings or errors thrown when minikube starts to see timeouts?


(Jesse Ezell) #8

estimated_pending_compaction_bytes: 0 B
W180929 07:53:42.777245 38380 storage/node_liveness.go:504 [n1,s1,r602/1:/System/tsd/cr.store.{re…-va…}] slow heartbeat took 37.6s
W180929 07:53:42.931409 38441 storage/node_liveness.go:504 [n1,s1,r621/1:/System/tsd/cr.node.sql.mem.…] slow heartbeat took 37.4s
W180929 07:53:42.969618 38440 vendor/google.golang.org/grpc/server.go:961 grpc: Server.processUnaryRPC failed to write status connection error: desc = “transport is closing”
W180929 07:53:42.971897 369 storage/raft_transport.go:465 [n1] raft transport stream to node 2 failed: EOF
W180929 07:53:43.112798 38377 vendor/google.golang.org/grpc/server.go:961 grpc: Server.processUnaryRPC failed to write status connection error: desc = “transport is closing”
I180929 07:53:43.313318 38475 cli/start.go:730 15 running tasks
I180929 07:53:44.556683 38378 storage/node_liveness.go:716 [n1,s1,r601/1:/System/ts{d/cr.st…-e}] retrying liveness update after storage.errRetryLiveness: result is ambiguous (error=context canceled [exhausted])
W180929 07:53:46.103982 197 storage/engine/rocksdb.go:1805 batch [1/51/0] commit took 1.415596536s (>500ms):
goroutine 197 [running]:
runtime/debug.Stack(0x3a43cbd0, 0xed3412687, 0x0)
/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).commitInternal(0xc428f91200, 0xc425db6800, 0xc, 0x400)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1806 +0x128
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc428f91200, 0xed3412601, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1724 +0x7c0
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc42078ee00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, …)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3623 +0x5ae
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x27840c0, 0xc425b7ec30, 0xc42078ee00, 0x27840c0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3868 +0x109
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc4204a2800, 0x27840c0, 0xc425b7ec30, 0xc425826888, 0xc427afded0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3190 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc4204a2800, 0x27840c0, 0xc4207855f0, 0x2a9)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3856 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc420a6e000, 0x27840c0, 0xc4207855f0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:226 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x27840c0, 0xc4207855f0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:166 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc420522960, 0xc4201d6a20, 0xc420522950)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xad
E180929 07:53:46.224798 38380 storage/replica_range_lease.go:282 [n1,s1,r602/1:/System/tsd/cr.store.{re…-va…}] context canceled
E180929 07:53:47.333065 38441 storage/replica_range_lease.go:282 [n1,s1,r621/1:/System/tsd/cr.node.sql.mem.…] context canceled
W180929 07:53:48.155558 265 storage/node_liveness.go:504 [n1,hb] slow heartbeat took 4.4s
I180929 07:53:48.158036 80 server/status/runtime.go:219 [n1] runtime stats: 955 MiB RSS, 227 goroutines, 122 MiB/35 MiB/185 MiB GO alloc/idle/total, 746 MiB/887 MiB CGO alloc/total, 9.11cgo/sec, 0.00/0.16 %(u/s)time, 0.00 %gc (0x)
I180929 07:53:49.411696 38475 cli/start.go:730 14 running tasks
W180929 07:53:49.799874 38378 storage/node_liveness.go:504 [n1,s1,r601/1:/System/ts{d/cr.st…-e}] slow heartbeat took 44.7s
W180929 07:53:51.146397 199 storage/store.go:3926 [n1,s1] handle raft ready: 12.1s [processed=0]
W180929 07:53:51.209300 178 storage/store.go:3926 [n1,s1] handle raft ready: 10.9s [processed=0]
W180929 07:53:51.209599 201 storage/store.go:3926 [n1,s1] handle raft ready: 10.4s [processed=0]
W180929 07:53:51.238634 228 storage/store.go:3926 [n1,s1] handle raft ready: 10.4s [processed=0]
W180929 07:53:51.327258 185 storage/store.go:3926 [n1,s1] handle raft ready: 12.2s [processed=0]
W180929 07:53:51.327352 211 storage/store.go:3926 [n1,s1] handle raft ready: 11.8s [processed=0]
W180929 07:53:51.522320 196 storage/store.go:3926 [n1,s1] handle raft ready: 12.4s [processed=0]
I180929 07:53:52.818008 38475 cli/start.go:730 12 running tasks
W180929 07:53:52.837556 208 storage/store.go:3926 [n1,s1] handle raft ready: 11.3s [processed=0]
W180929 07:53:52.846110 221 storage/store.go:3926 [n1,s1] handle raft ready: 11.1s [processed=0]
W180929 07:53:52.846403 205 storage/store.go:3926 [n1,s1] handle raft ready: 19.0s [processed=0]
W180929 07:53:52.864946 195 storage/store.go:3926 [n1,s1] handle raft ready: 13.8s [processed=0]
W180929 07:53:52.877117 265 storage/node_liveness.go:441 [n1,hb] failed node liveness heartbeat: context deadline exceeded
W180929 07:53:52.878366 183 storage/store.go:3926 [n1,s1] handle raft ready: 2.0s [processed=0]
E180929 07:53:53.311032 38378 storage/replica_range_lease.go:282 [n1,s1,r601/1:/System/ts{d/cr.st…-e}] context canceled
I180929 07:53:53.854333 38508 storage/raft_transport.go:459 [n1] raft transport stream to node 2 established
W180929 07:53:54.483819 212 storage/engine/rocksdb.go:1805 batch [1/61/0] commit took 1.01623909s (>500ms):
goroutine 212 [running]:
runtime/debug.Stack(0x165edc18, 0xed3412691, 0x0)
/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).commitInternal(0xc42614e480, 0x0, 0x1, 0xc42520e6f8)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1806 +0x128
github.com/cockroachdb/cockroach/pkg/storage/engine.(*rocksDBBatch).Commit(0xc42614e480, 0xed3412601, 0x0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/engine/rocksdb.go:1724 +0x7c0
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked(0xc420905500, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, …)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/replica.go:3623 +0x5ae
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue.func1(0x27840c0, 0xc42508c480, 0xc420905500, 0x27840c0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3868 +0x109
github.com/cockroachdb/cockroach/pkg/storage.(*Store).withReplicaForRequest(0xc4204a2800, 0x27840c0, 0xc42508c480, 0xc4297f24e0, 0xc425a9ded0, 0x0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3190 +0x135
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRequestQueue(0xc4204a2800, 0x27840c0, 0xc420a87dd0, 0x259)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:3856 +0x229
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).worker(0xc420a6e000, 0x27840c0, 0xc420a87dd0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:226 +0x21b
github.com/cockroachdb/cockroach/pkg/storage.(*raftScheduler).Start.func2(0x27840c0, 0xc420a87dd0)
/go/src/github.com/cockroachdb/cockroach/pkg/storage/scheduler.go:166 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc420522b40, 0xc4201d6a20, 0xc420522b30)
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:196 +0xe9
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:189 +0xad
I180929 07:53:55.502223 80 server/status/runtime.go:219 [n1] runtime stats: 955 MiB RSS, 228 goroutines, 123 MiB/34 MiB/185 MiB GO alloc/idle/total, 747 MiB/887 MiB CGO alloc/total, 34.80cgo/sec, 0.00/0.13 %(u/s)time, 0.00 %gc (0x)
W180929 07:53:56.536483 38394 kv/dist_sender.go:1305 [ts-poll,n1] have been waiting 1m0s sending RPC to r604 (currently pending: [(n2,s2):3]) for batch: Merge [/System/tsd/cr.node.txn.aborts/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.commits/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.commits1PC/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.autoretries/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.abandons/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-max/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-p99.999/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-p99.99/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-p99.9/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-p99/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-p90/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-p75/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.durations-p50/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.restarts-max/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.restarts-p99.999/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.restarts-p99.99/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.restarts-p99.9/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.restarts-p99/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.restarts-p90/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.node.txn.restarts-p75/1/10s/2018-09-29T07:00:00Z,/Min), … 31 skipped …, Merge [/System/tsd/cr.store.compactor.suggestionbytes.skipped/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.store.compactor.suggestionbytes.compacted/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.store.compactor.compactions.success/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.store.compactor.compactions.failure/1/10s/2018-09-29T07:00:00Z,/Min), Merge [/System/tsd/cr.store.compactor.compactingnanos/1/10s/2018-09-29T07:00:00Z,/Min)
I180929 07:53:57.452103 38475 cli/start.go:730 13 running tasks
W180929 07:54:00.390559 266 sql/jobs/registry.go:300 canceling all jobs due to liveness failure
W180929 07:54:03.233120 259 kv/dist_sender.go:1305 [txn=658f8a8c,n1] have been waiting 1m0s sending RPC to r7 (currently pending: [(n3,s3):2]) for batch: [txn: 658f8a8c], Get [/Table/3/1/108/2/1,/Min)


(Tim O'Brien) #9

Hey @jezell, those logs indicate that the node is struggling, and is running out of either CPU or Disk IO. For example, it shows commits and small updates taking 30s+. It’s likely getting to a point where it’s unresponsive and can’t catch up. Given that you’re running on minikube, you might have better results by scaling out beyond a single machine. You’d never run multiple nodes on a single machine in production, so it’s not really a good way to test CRDB’s performance. We have a tutorial on deploying a single cluster using kubernetes here.


(Jesse Ezell) #10

I’m not really trying to test the performance, just to have it integrated into a local development environment in a stable way. Would it be better to have a single node CRDB cluster for local development to reduce the resources (if so, is there a sample config / guide for that somewhere?). Would it be safe to increase the readiness / liveness check timeouts a bit so it doesn’t just keep getting recycled?

Cockroach seems to restart quite frequently on minikube even when it’s not really under any load. Maybe the defaults aren’t ideal for a local development cluster? It’s certainly possible that our setup is packed a bit tight and that’s causing trouble, but we’ve been using this setup successfully for years with other databases, so ideally we wouldn’t have to throw out our local environments just to put in Cockroach.


(Jesse Ezell) #11

Attempting to scale down cluster seems to break the cluster… is scaling down the replicaset supported?

kubectl scale statefulset --replicas=2

waited for scale and then scaling down to 1 with

kubectl scale statefulset --replicas=1

Resulted in endless stream of errors from the remaining pod:

W181001 22:59:35.602072 279777 vendor/google.golang.org/grpc/clientconn.go:830 Failed to dial cockroachdb-2.cockroachdb.default.svc.cluster.local:26257: context canceled; please retry.
I181001 22:59:36.740974 311 server/status/runtime.go:219 [n1] runtime stats: 1.1 GiB RSS, 193 goroutines, 141 MiB/78 MiB/268 MiB GO alloc/idle/total, 797 MiB/941 MiB CGO alloc/total, 78.80cgo/sec, 0.02/0.01 %(u/s)time, 0.00 %gc (0x)
W181001 22:59:37.246860 320 storage/node_liveness.go:504 [n1,hb] slow heartbeat took 4.5s
W181001 22:59:37.247132 320 storage/node_liveness.go:441 [n1,hb] failed node liveness heartbeat: context deadline exceeded
W181001 22:59:37.404121 279837 vendor/google.golang.org/grpc/clientconn.go:1158 grpc: addrConn.createTransport failed to connect to {cockroachdb-2.cockroachdb.default.svc.cluster.local:26257 0 }. Err :connection error: desc = “transport: Error while dialing dial tcp: lookup cockroachdb-2.cockroachdb.default.svc.cluster.local: no such host”. Reconnecting…
W181001 22:59:37.454698 279666 vendor/google.golang.org/grpc/clientconn.go:1158 grpc: addrConn.createTransport failed to connect to {cockroachdb-1.cockroachdb.default.svc.cluster.local:26257 0 }. Err :connection error: desc = “transport: Error while dialing dial tcp 172.17.0.7:26257: i/o timeout”. Reconnecting…
W181001 22:59:37.454888 279666 vendor/google.golang.org/grpc/clientconn.go:1158 grpc: addrConn.createTransport failed to connect to {cockroachdb-1.cockroachdb.default.svc.cluster.local:26257 0 }. Err :connection error: desc = “transport: Error while dialing cannot reuse client connection”. Reconnecting…
W181001 22:59:37.454965 279666 vendor/google.golang.org/grpc/clientconn.go:830 Failed to dial cockroachdb-1.cockroachdb.default.svc.cluster.local:26257: context canceled; please retry.
W181001 22:59:38.402102 279837 vendor/google.golang.org/grpc/clientconn.go:1158 grpc: addrConn.createTransport failed to connect to {cockroachdb-2.cockroachdb.default.svc.cluster.local:26257 0 }. Err :connection error: desc = “transport: Error while dialing cannot reuse client connection”. Reconnecting…
W181001 22:59:38.402166 279837 vendor/google.golang.org/grpc/clientconn.go:830 Failed to dial cockroachdb-2.cockroachdb.default.svc.cluster.local:26257: context canceled; please retry.


(bram@cockroachlabs.com) #12

Hi @jezell,

You can scale a cluster down, but it has to be done slowly. And more than that, you can’t scale down lower than the quorum needed based on the replication factor as defined in your zone configs.
See https://www.cockroachlabs.com/docs/stable/configure-replication-zones.html#replication-zone-format for more details.

Since your cluster probably has the default number of replicas of 3, you can scale down to 2 (but this would be dangerous), but not lower or it won’t be able to be make progress. You can guarantee consistency if we don’t have > 50% of the replicas.

I think you can set your zone configs to set the number of replicas to 1, but this would have to be done on the system zones as well. I haven’t tested this in a while, but it should work. But to remove a node, you would have to decommission it, so it will move all replicas off of it that are still needed. Since once you have a replica factor of 1, there is only a single copy of each key/value and just pulling a node would result in data loss.

Did that makes sense?

I’d recommend starting a fresh single node cluster as it is a much faster procedure.


(Jesse Ezell) #13

Thanks, I’ll give that a shot. Would be really nice if you could run a one node cluster to save some resources on a local setup.


(bram@cockroachlabs.com) #14

Oh, you can run a one node cluster locally, it’s how we do our own testing and is what we expect most developers will do.

The problem arises when you scale it up. The moment you do, it expects a minimum size. So start a fresh single node cluster and you should be good to go.


(Jesse Ezell) #15

Awesome. If I modify the replicas to be 1 in the standard yaml file and run kubectl create, it still tries to connect to nodes 2-3. Is there a sample yaml file somewhere for starting up with that config in kubernetes?


(bram@cockroachlabs.com) #16

Don’t worry about the zone configs if you start a fresh cluster. It will try for a bit to find a 2nd and 3rd node, but will give up. You can ignore the warnings.


(Jesse) #17

Piggybacking on Bram’s responses, if you do want to stop those replication warnings in the log, you can run either update the .default replication zone to have 1 replica, or you can run cockroach zone set .default --insecure --disable-replication. See these docs for more details.


(Jesse Ezell) #18

Awesome! Thanks so much for the help.


(Jesse Ezell) #19

Just an update, switching to one replica and changing the readiness / liveness checks to not be so aggressive solved the local/minikube stability issues so far. Much more stable that way.