Maintenance of hosts in a Kubernetes cluster

Sometimes you need to restart, install updates, or update the operating system on worker nodes and controllers of the Kubernetes cluster. This article describes a host maintenance procedure that minimizes the downtime of the KUMA Core in a high-availability configuration.

Before taking hosts out for maintenance, you need to create a backup copy of the KUMA core.

Controller maintenance

Cluster controllers must undergo maintenance one at a time. No additional steps are required before performing maintenance of a controller. After maintenance or an upgrade, you must make sure that the controller service has started successfully by checking the status of the service using the following commands:

sudo systemctl status k0scontroller

sudo k0s status

After the controller comes back up after maintenance, you can perform maintenance of the next controller. As long as you perform maintenance of controllers one by one, KUMA Core remains available.

Maintenance of working nodes

Worker nodes of the cluster must be taken out for maintenance strictly one at a time.

To perform maintenance on worker nodes:

Get the names and current status of worker nodes by running the following command on any controller:
sudo k0s kubectl get nodes

All worker nodes must have the ready status, otherwise the decommissioning of more nodes may lead to unavailability of the KUMA Core.
Before upgrading a worker node, you must first prevent pods from being placed on that node. To prevent the placement of pods, run the following command on any controller:
sudo k0s kubectl cordon <worker_node_name>

After that, the output of the sudo k0s kubectl get nodes command shows the node with the Ready,SchedulingDisabled status.
Get the name of the worker node on which the KUMA Core is running by running the following command on any controller:
sudo k0s kubectl get pods -n kuma -o wide
If the KUMA Core is running on the worker node which is undergoing maintenance, restart the KUMA Core deployment to move the KUMA Core to another worker node, and wait until the KUMA Core is successfully started on the other worker node. To restart the deployment of the KUMA Core, run the following command:
sudo k0s kubectl rollout restart deployment core-deployment -n kuma

While the KUMA Core is being moved to another worker node, the KUMA Core is unavailable for approximately 10 minutes.
Check the status of the KUMA Core by running the following command again:
sudo k0s kubectl get pods -n kuma -o wide

The pod must have the Running status, and the name of the worker node in the NODE column must change.
Stop the k0sworker service on the node that you are taking out with the following command:
sudo k0s stop
Update the operating system.
Start the k0s service on the worker node with the following command:
sudo k0s start
Make sure that the k0s service has started successfully by running the following commands on the worker node:
sudo systemctl status k0sworker

sudo k0s status

The k0sworker.service must be in the active (running) state.

The 'k0s status' command should return Status: Running.
Allow pods to be run on the updated worker node by running the following command on any controller:
sudo k0s kubectl uncordon <worker_node_name>
Make sure that the worker node in the cluster is available by running the following command on any controller:
sudo k0s kubectl get nodes

The updated worker node must have the ready status.
After returning the working node to the cluster, make sure that all volume replicas are restored and available. To view the status of a volume, run the following command:
sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'

The status must be healthy. If the status is degraded, then one of the replicas is unavailable or is being rebuilt.
If you want to monitor the volume rebuilding process, run the following command:
sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'
If you want to view the current status of replicas, run the following command:
sudo k0s kubectl get replicas -n longhorn-system

All replicas must have the running status.

Worker node maintenance is complete. If the serviced worker node is ready and the KUMA Core volume has the healthy status, you can proceed to perform maintenance on the next worker node.

Maintenance of the traffic balancer

Maintenance of the traffic balancer always results in the KUMA Core becoming temporarily unavailable, both for users and for KUMA services. While the balancer is unavailable, a KUMA Core pot cannot be moved from one worker node to another.

If you are planning a long downtime of the main balancer or substantial upgrades, we recommend the following:

Prepare a backup balancer with all necessary updates, duplicate the configuration of nginx or other traffic balancer that you are using from your main balancer to the backup balancer. For Nginx configured by the KUMA installer, the relevant files are /etc/nginx/nginx.conf and /etc/nginx/kuma_nginx_lb.conf. The FQDN of the backup balancer must be the same as the FQDN of the main balancer. You need to make sure that the balancer service has started successfully and that the firewall rules applied to the main balancer are also applicable to the backup load balancer.
Switch traffic over to the backup balancer by changing the IP address of the backup balancer to the IP address of the main balancer. On the main balancer, change the IP address to some other IP address to prevent IP address conflict. Make sure that the KUMA Core on the backup balancer is working properly.
When you switch over the traffic, current sessions may be terminated. In case of any problems, you can specify the old IP addresses and continue using the main balancer in the previous configuration.
Perform maintenance on the main balancer as necessary.
Change the IP addresses on the main and backup balancers to their old values.
Make sure that the KUMA Core on the updated main balancer is working properly.

The last step may not be necessary if you want to discard the old main balancer, that is, if you are permanently replacing the balancer with an updated host.

The resource requirements of the traffic balancer are minimal, therefore we recommend having a backup clone of the balancer virtual machine on hand to quickly restore the availability of the KUMA Core in case of any problems with the main virtual machine.

Page top