Maintenance of hosts in a Kubernetes cluster

Sometimes you need to restart, install updates, or update the operating system on worker nodes and controllers of the Kubernetes cluster. The following sections of this article describe how to take hosts out for maintenance to minimize the downtime of the KUMA Core in a high-availability configuration.

Before performing any manipulations with hosts, you must back up the KUMA Core.

Controller maintenance

Cluster controllers must be taken out of service strictly one at a time. You do not need to perform any preliminary steps before taking a controller out for maintenance. After performing maintenance or an upgrade, you must make sure that the controller service has started successfully by checking the status of the service using the following commands:

sudo systemctl status k0scontroller

sudo k0s status

After the controller comes back up after maintenance, you can take out the next controller for maintenance. The availability of the KUMA Core is not interrupted when the controllers are taken out for maintenance one after another.

Maintenance of working nodes

Worker nodes of the cluster must be taken out of service strictly one at a time.

To perform maintenance on worker nodes:

  1. Get the names and current status of worker nodes by running the following command on any controller:

    sudo k0s kubectl get nodes

    All worker nodes must have the ready status, otherwise the decommissioning of more nodes may lead to the complete unavailability of the KUMA Core.

  2. Before upgrading a worker node, you must first prevent pods from being placed on that node. To prevent the placement of pods, run the following command on any controller:

    sudo k0s kubectl cordon <worker_node_name>

    After that, the output of the sudo k0s kubectl get nodes command shows the node with the Ready,SchedulingDisabled status.

  3. To get the name of the worker node on which the KUMA Core is running, run the following command on any controller:

    sudo k0s kubectl get pods -n kuma -o wide

  4. If the KUMA Core is running on this worker node, you must restart the KUMA Core deployment to move the KUMA Core to another worker node, and wait until the KUMA Core is successfully started on the other worker node. To restart the deployment of the KUMA Core, run the following command:

    sudo k0s kubectl rollout restart deployment core-deployment -n kuma

    While the KUMA Core is being moved to another worker node, access to the KUMA Core is suspended for approximately 10 minutes.

  5. To check the status of the KUMA Core, run the following command again: 

    sudo k0s kubectl get pods -n kuma -o wide

    The pod must have the Running status, and the name of the worker node in the NODE column must change.

  6. Stop the k0sworker service on the node that you are taking out with the following command:

    sudo k0s stop

  7. Upgrade the operating system
  8. Start the k0s service on the worker node with the following command:

    sudo k0s start

  9. Make sure that the k0s service has started successfully by running the following commands on the worker node:

    sudo systemctl status k0sworker

    sudo k0s status

  10. Allow pods to be run on the updated worker node by running the following command on any controller:

    sudo k0s kubectl uncordon <worker_node_name>

  11. Make sure that the worker node in the cluster is available by running the following command on any controller:

    sudo k0s kubectl get nodes

    The updated worker node must have the ready status.

  12. After returning the working node to the cluster, you need to make sure that all volume replicas are restored and available. To view the status of a volume, run the following command:

    sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'

    The status should be healthy; if the status is degraded, one of the replicas is unavailable or is being rebuilt.

  13. To monitor the progress of volume rebuilding, run the following command: 

    sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'

  14. To view the current status of replicas, run the following command:

    sudo k0s kubectl get replicas -n longhorn-system

    All replicas must have the running status.

Worker node maintenance is complete. If the serviced worker node is ready and the KUMA Core volume has the 'healthy' status, you can proceed to perform maintenance on the next worker node.

Maintenance of the traffic balancer

Taking out the traffic balancer for maintenance always results in the KUMA Core becoming temporarily unavailable, both for users and for KUMA services. While the balancer is unavailable, a KUMA Core pot cannot be moved from one worker node to another.

If you are planning a long downtime of the main balancer or substantial upgrades, we recommend the following:

The last step may not be necessary if you want to discard the old main balancer, that is, if you are permanently replacing the balancer with an updated host.

The resource requirements of the traffic balancer are minimal, therefore we recommend having a backup clone of the balancer virtual machine on hand to quickly restore the availability of the KUMA Core in case of any problems with the main virtual machine.

Page top