Rolling restart of nodes for a Kubernetes multi-cluster deployment

deployment
(Rolland M.) #1

What’s the best way to do rolling restart of nodes in a multi-cluster deployment with Kubernetes?

I do have automation in place to install and run a multi-cluster deployment in multiple AWS regions, but when I updated the statefulset running in the primary region first and then on second AWS region, nodes in first region were not able to join back the cluster. Any idea or pointer on what could have caused that problem?

Thanks,
Rolland.

(Ron Arévalo) #2

Hey @rmewanou,

Did you receive any errors? Were you able to access the Admin UI from the first region nodes? If so were all nodes listed on the overview page?

Also this doc might be helpful.

Thanks,

Ron

(Rolland M.) #3

Thanks for the reply Ron, I was able to reach the admin UI in the second region as second region nodes came back together but that was all. Most Admin UI pages were not working but overview page showed all the nodes from the primary region as dead.

I was not able to see logs and since this was in test environment and we needed to resume testing quick, I actually deleted the cluster and restarted from scratch. I am going to re-test the upgrade process following the link you provided but I don’t see anything different from what I did in the following steps, apart from not setting the cluster.preserve_downgrade_option as suggested in that page:
1- Update statefulset in first region, wait for changes to get applied and pods to come back online
2- Move to second region and repeat the same steps.

Is there any other documentation on additional steps required? Is the cluster setting change required if I am only adding more metadata to the Kubernetes pods? I will retry the same steps and update this question if I get more details.

Again, thanks for your help and support.
Rolland.

(Rolland M.) #4

For those interested in this topic, I did apply patches with successful restarts this time without problems. Main difference is that I waited longer between patch/restarts on each cluster and there were no ongoing transactions. I’ll post an update if I get any other issue.