I have created a cluster with 6 nodes, 3 for each datacenter (I’m using version v2.0)
n1 <=== D1
n2 <=== D1
n3 <=== D1
n4 <=== D2
n5 <=== D2
n6 <=== D2
I started an application writing on n1. So far so good. The database I have created has a replica factor of 6.
I have shutdown cockroach on n5 and n6 (pkill cockroach) and so far so good. Then the disaster:
I shutdown n4 as well the application stuck (I expected this) but from this point on I was not able to recover the entire cluster readding that last removed node, restarting n4 the cluster was still down. Connecting to cockroach console on n1 gives me: “Connection to CockroachDB node lost”. At a certain point the client I had (c++ using pqxx) terminated with:
“terminate called after throwing an instance of ‘pqxx::sql_error’
what(): ERROR: waiting on split that failed: split at key /Table/52/1/335888366539702275 failed: context deadline exceeded”.
Eventually to recover the cluster I had to restart n5 and n6 as well.
Is this expected?