Maintenance of hosts in a Kubernetes cluster

Sometimes you need to restart, install updates, or update the operating system on worker nodes and controllers of the Kubernetes cluster. The following sections of this article describe how to take hosts out for maintenance to minimize the downtime of the KUMA Core in a high-availability configuration.

Before performing any manipulations with hosts, you must back up the KUMA Core.

Controller maintenance

Cluster controllers must be taken out of service strictly one at a time. You do not need to perform any preliminary steps before taking a controller out for maintenance. After performing maintenance or an upgrade, you must make sure that the controller service has started successfully by checking the status of the service using the following commands:

sudo systemctl status k0scontroller

sudo k0s status

After the controller comes back up after maintenance, you can take out the next controller for maintenance. The availability of the KUMA Core is not interrupted when the controllers are taken out for maintenance one after another.

Maintenance of working nodes

Worker nodes of the cluster must be taken out of service strictly one at a time.

To perform maintenance on worker nodes:

Get the names and current status of worker nodes by running the following command on any controller:
sudo k0s kubectl get nodes

All worker nodes must have the ready status, otherwise the decommissioning of more nodes may lead to the complete unavailability of the KUMA Core.
Before upgrading a worker node, you must first prevent pods from being placed on that node. To prevent the placement of pods, run the following command on any controller:
sudo k0s kubectl cordon <worker_node_name>

After that, the output of the sudo k0s kubectl get nodes command shows the node with the Ready,SchedulingDisabled status.
To get the name of the worker node on which the KUMA Core is running, run the following command on any controller:
sudo k0s kubectl get pods -n kuma -o wide
If the KUMA Core is running on this worker node, you must restart the KUMA Core deployment to move the KUMA Core to another worker node, and wait until the KUMA Core is successfully started on the other worker node. To restart the deployment of the KUMA Core, run the following command:
sudo k0s kubectl rollout restart deployment core-deployment -n kuma

While the KUMA Core is being moved to another worker node, access to the KUMA Core is suspended for approximately 10 minutes.
To check the status of the KUMA Core, run the following command again:
sudo k0s kubectl get pods -n kuma -o wide

The pod must have the Running status, and the name of the worker node in the NODE column must change.
Stop the k0sworker service on the node that you are taking out with the following command:
sudo k0s stop
Upgrade the operating system
Start the k0s service on the worker node with the following command:
sudo k0s start
Make sure that the k0s service has started successfully by running the following commands on the worker node:
sudo systemctl status k0sworker

sudo k0s status
Allow pods to be run on the updated worker node by running the following command on any controller:
sudo k0s kubectl uncordon <worker_node_name>
Make sure that the worker node in the cluster is available by running the following command on any controller:
sudo k0s kubectl get nodes

The updated worker node must have the ready status.
After returning the working node to the cluster, you need to make sure that all volume replicas are restored and available. To view the status of a volume, run the following command:
sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'

The status should be healthy; if the status is degraded, one of the replicas is unavailable or is being rebuilt.
To monitor the progress of volume rebuilding, run the following command:
sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'
To view the current status of replicas, run the following command:
sudo k0s kubectl get replicas -n longhorn-system

All replicas must have the running status.

Worker node maintenance is complete. If the serviced worker node is ready and the KUMA Core volume has the 'healthy' status, you can proceed to perform maintenance on the next worker node.

Maintenance of the traffic balancer

Taking out the traffic balancer for maintenance always results in the KUMA Core becoming temporarily unavailable, both for users and for KUMA services. While the balancer is unavailable, a KUMA Core pot cannot be moved from one worker node to another.

If you are planning a long downtime of the main balancer or substantial upgrades, we recommend the following:

Prepare a backup balancer with all necessary updates, duplicate the configuration of Nginx or other traffic balancer that you are using from your main balancer to the backup balancer. For Nginx configured by the KUMA installer, the relevant files are /etc/nginx/nginx.conf and /etc/nginx/kuma_nginx_lb.conf. The FQDN of the backup balancer must be the same as the FQDN of the main balancer. You need to make sure that the balancer service has started successfully and that the firewall rules applied to the main balancer are also applicable to the backup load balancer.
Switch traffic over to the backup balancer by changing the IP address of the backup balancer to the IP address of the main balancer. On the main balancer, change the IP address to some other IP address to prevent IP address conflict. Make sure that the KUMA Core on the backup balancer is working properly.
When you switch over the traffic, current sessions may be terminated. In case of any problems, you can specify the old IP addresses and continue using the main balancer in the previous configuration.
Perform maintenance on the main balancer as necessary.
Change the IP addresses on the main and backup balancers to their old values.
Make sure that the KUMA Core on the updated main balancer is working properly.

The last step may not be necessary if you want to discard the old main balancer, that is, if you are permanently replacing the balancer with an updated host.

The resource requirements of the traffic balancer are minimal, therefore we recommend having a backup clone of the balancer virtual machine on hand to quickly restore the availability of the KUMA Core in case of any problems with the main virtual machine.

Page top