Cold starts of large clusters may require manual intervention

I was trying to rolling upgrade my db from 19.1.5 to 19.2.0 but failed.

I stopped my first server and replace with new binary; however, the whole cluster stop response.
I then roll-back the binary to 19.1.5, it did not help and server stop response.
I stopped every node, and restarted the cluster and failed.
I replace all binary to 19.2.0 and restarted, it failed.
I then set these two environment variables according to known limitation and the server restarted without issue.
COCKROACH_SCAN_INTERVAL=60m
COCKROACH_SCAN_MIN_IDLE_TIME=1s
I finalized the upgrade and let the server ran for a few hours everything looks good.
I removed the above environments and try to rolling restart each node, but the cluster stop responding.
I need to set the environment variables and restart the cluster, and then it works.

My question.
I don’t want to change any default setting, eg change range_max_bytes to 128MB, am I safe to keep the above environment variables for production? Cause if any of the server get rebooted, it may crash the whole cluster.

Hey @leokwan

As the documentation states, you need to allow > 90% of the replicas to become quiescent before doing the rolling restart. Please monitor the quiescence of the replicas in the AdminUI, and once its above 90%, feel free to try removing the environmental variables and restarting.

Bear in mind that we are tracking this issue in GitHub, which you can find here. Let me know how it works out for you.

Cheers,
Ricardo