Node can not join back cluster after failure

I use cockroachdb v21.1.5.
I have a 3 nodes deployment on 3 separate k8s clusters.

My pods are launched with the following options :

/cockroach/cockroach start --join=mynode1,localhost,mynode3 --cluster-name=split-db --disable-cluster-name-verification --logtostderr=INFO --certs-dir=/cockroach/cockroach-certs/ --http-port=8080 --port=26257 --cache=25% --max-sql-memory=25% --advertise-host=mynode2 --advertise-addr=mynode2:26257.

My setup runs fine. But when I simulate a node 2 failure by scaling down its statefulset. I can not make the node join back the cluster. On restart, the pod outputs the following error:

I211007 07:56:59.298879 13 server/node.go:388 ⋮ [n2] 37 initialized store s2
I211007 07:56:59.298984 13 kv/kvserver/stores.go:250 ⋮ [n2] 38 read 2 node addresses from persistent storage
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 can’t determine lease status of (n1,s1):1 due to node liveness error: liveness record not found in cache
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 +(1) attached stack trace
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + – stack trace:
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness.init
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/liveness/liveness.go:52
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | runtime.doInit
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /usr/local/go/src/runtime/proc.go:5652
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | runtime.doInit
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /usr/local/go/src/runtime/proc.go:5647
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | runtime.doInit
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /usr/local/go/src/runtime/proc.go:5647
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | runtime.doInit
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /usr/local/go/src/runtime/proc.go:5647
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | runtime.doInit
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /usr/local/go/src/runtime/proc.go:5647
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | runtime.doInit
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /usr/local/go/src/runtime/proc.go:5647
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | runtime.main
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /usr/local/go/src/runtime/proc.go:191
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | runtime.goexit
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 + | /usr/local/go/src/runtime/asm_amd64.s:1374
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 +Wraps: (2) liveness record not found in cache
W211007 07:56:59.300478 182 kv/kvserver/replica_range_lease.go:611 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 39 +Error types: (1) *withstack.withStack (2) *errutil.leafError
W211007 07:56:59.300817 182 kv/kvserver/store.go:1690 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 40 could not gossip system config: ‹[NotLeaseHolderError] lease state couldn’t be determined; r6: replica (n2,s2):15 not lease holder; lease holder unknown›
W211007 07:56:59.300817 182 kv/kvserver/store.go:1690 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 40 +(1) ‹[NotLeaseHolderError] lease state couldn’t be determined; r6: replica (n2,s2):15 not lease holder; lease holder unknown›
W211007 07:56:59.300817 182 kv/kvserver/store.go:1690 ⋮ [n2,s2,r6/15:‹/Table/{SystemCon…-11}›] 40 +Error types: (1) *roachpb.NotLeaseHolderError
I211007 07:56:59.301717 196 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 ⋮ [n2] 41 circuitbreaker: ‹rpc [::]:26257 [n3]› tripped: failed to resolve n3: unable to look up descriptor for n3
I211007 07:56:59.301828 196 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447 ⋮ [n2] 42 circuitbreaker: ‹rpc [::]:26257 [n3]› event: ‹BreakerTripped›
I211007 07:56:59.304829 124 1@server/server.go:1555 ⋮ [n2] 43 node connected via gossip
I211007 07:56:59.304957 195 1@vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322 ⋮ [n2] 44 circuitbreaker: ‹rpc [::]:26257 [n1]› tripped: failed to resolve n1: unable to look up descriptor for n1

Any thoughts ?

Hi,

Can you share your statefulset configuration for me to take a look at? Are you deploying this across 3 distinct kubernetes clusters, or 3 nodes within a single kubernetes cluster?

Hi,

I am deploying 3 nodes total spread across 3 distinct kubernetes clusters.

Here is the statefulset config of node 2:

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: split-db-eu2
namespace: cockroachdb-eu2
uid: 3021b763-95b3-41ef-8e25-e46ac989004a
resourceVersion: ‘97180018’
generation: 7
creationTimestamp: ‘2021-10-07T07:52:49Z’
labels:
app(dot)kubernetes(dot)io/component: cockroachdb
app(dot)kubernetes(dot)io/instance: split-db-eu2
app(dot)kubernetes(dot)io/managed-by: Helm
app(dot)kubernetes(dot)io/name: eu2
argocd.argoproj.io/instance: split-db-eu2
helm.sh/chart: cockroachdb-6.0.6-div1
managedFields:
- manager: argocd-application-controller
operation: Update
apiVersion: apps/v1
time: ‘2021-10-07T07:52:49Z’
fieldsType: FieldsV1
status:
observedGeneration: 7
replicas: 1
readyReplicas: 1
currentReplicas: 1
updatedReplicas: 1
currentRevision: split-db-eu2-684949d85c
updateRevision: split-db-eu2-684949d85c
collisionCount: 0
spec:
replicas: 1
selector:
matchLabels:
app(dot)kubernetes(dot)io/component: cockroachdb
app(dot)kubernetes(dot)io/instance: split-db-eu2
app(dot)kubernetes(dot)io/name: eu2
template:
metadata:
creationTimestamp: null
labels:
app(dot)kubernetes(dot)io/component: cockroachdb
app(dot)kubernetes(dot)io/instance: split-db-eu2
app(dot)kubernetes(dot)io/name: eu2
spec:
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir
- name: certs
emptyDir: {}
- name: certs-secret
secret:
secretName: cockroachdb-node
defaultMode: 256
initContainers:
- name: copy-certs
image: busybox
command:
- /bin/sh
- ‘-c’
- >-
cp -f /certs/* /cockroach-certs/; chmod 0400
/cockroach-certs/*.key
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
resources: {}
volumeMounts:
- name: certs
mountPath: /cockroach-certs/
- name: certs-secret
mountPath: /certs/
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
containers:
- name: db
image: ‘cockroachdb/cockroach:v21.1.5’
args:
- shell
- ‘-ecx’
- >-
exec /cockroach/cockroach start
–join=mynode1,localhost,mynode3
–cluster-name=split-db --disable-cluster-name-verification
–advertise-host=$(hostname).${STATEFULSET_FQDN}
–logtostderr=INFO --certs-dir=/cockroach/cockroach-certs/
–http-port=8080 --port=26257 --cache=25% --max-sql-memory=25%
–advertise-host=mynode2
–advertise-addr=mynode2:26257
ports:
- name: grpc
containerPort: 26257
protocol: TCP
- name: http
containerPort: 8080
protocol: TCP
env:
- name: STATEFULSET_NAME
value: split-db-eu2
- name: STATEFULSET_FQDN
value: split-db-eu2(dot)cockroachdb-eu2(dot)svc(dot)cluster(dot)local
- name: COCKROACH_CHANNEL
value: kubernetes-helm
resources:
limits:
cpu: ‘1’
memory: 2Gi
requests:
cpu: 100m
memory: 512Mi
volumeMounts:
- name: datadir
mountPath: /cockroach/cockroach-data/
- name: certs
mountPath: /cockroach/cockroach-certs/
- name: certs-secret
mountPath: /cockroach/certs/
- name: backup
mountPath: /backups
livenessProbe:
httpGet:
path: /health
port: http
scheme: HTTPS
initialDelaySeconds: 30
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 3
readinessProbe:
httpGet:
path: /health?ready=1
port: http
scheme: HTTPS
initialDelaySeconds: 10
timeoutSeconds: 1
periodSeconds: 5
successThreshold: 1
failureThreshold: 2
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
imagePullPolicy: IfNotPresent
restartPolicy: Always
terminationGracePeriodSeconds: 60
dnsPolicy: ClusterFirst
serviceAccountName: split-db-eu2
serviceAccount: split-db-eu2
securityContext: {}
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app(dot)kubernetes(dot)io/component: cockroachdb
app(dot)kubernetes(dot)io/instance: split-db-eu2
app(dot)kubernetes(dot)io/name: eu2
topologyKey: kubernetes(dot)io/hostname
schedulerName: default-scheduler
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology(dot)kubernetes(dot)io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app(dot)kubernetes(dot)io/component: cockroachdb
app(dot)kubernetes(dot)io/instance: split-db-eu2
app(dot)kubernetes(dot)io/name: eu2
volumeClaimTemplates:
- kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: datadir
creationTimestamp: null
labels:
app(dot)kubernetes(dot)io/instance: split-db-eu2
app(dot)kubernetes(dot)io/name: eu2
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 8Gi
volumeMode: Filesystem
status:
phase: Pending
- kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: backup
creationTimestamp: null
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
volumeMode: Filesystem
status:
phase: Pending
serviceName: split-db-eu2
podManagementPolicy: Parallel
updateStrategy:
type: RollingUpdate
revisionHistoryLimit: 10

Note that :

  • the nodes are exposed publicly on hosts: mynode1, mynode2 and mynode3
  • I had to replace “.” in urls by (dot) because of the forum policy.

Thanks

So mynode1 and mynode2 are the IPs of the nodes you’re trying to connect to? And you’ve verified that it is picking up the same pvc when you scale the statefulset back up?

Sorry for the late reply.

Actually this is the node #2 so, node1 and node3 are the hostnames I am trying to connect to.
And yes it is picking the same PVC.

The only way to make my node join the cluster back is to the PVC, and restart the pod, then the node joins the cluster as “node #4”. This is not a solution as it requires a manual actions + the data has to be resynced with node #1 and node#3