Failure scenarios

Temporary unavailability of individual cluster components

If individual cluster components are temporarily unavailable, for example, due to a temporary server power outage, lack of network connectivity, or a failure that required restarting the server or virtual machine, recovery does not require creating virtual machines from scratch or replacing servers. KUMA Core availability in this case is determined by the set of components that remain in operation. This scenario also covers failures on a server or virtual machine that do not require reinstalling the operating system and that can be quickly remedied by replacing individual hardware components or changing the software configuration.

After restoring the availability of all components, the health of the cluster is restored automatically, however, transient operations take some time to complete, such as the synchronization of volume replicas, during which the cluster remains vulnerable to new failures of other components. Thus, synchronization of replicas of large volumes can take several hours. This recovery time must be taken into account when planning an exercise with the deliberate shutdown of worker nodes.

Complete failure of cluster components while the availability of the KUMA Core is maintained

In this case, the cluster allows using the KUMA Core for some time (until the next component fails), choosing an appropriate time window for recovery, and making an up-to-date backup copy of the KUMA Core.

To restore a cluster:

Prepare new virtual machines or servers to replace failed cluster components in accordance with the KUMA installation requirements.
At this step, you can use snapshots of virtual machines taken before installing KUMA.
Review the k0s.inventory.yml inventory file and update it if necessary. If you have many services and you need to minimize installation time, you can leave one host in the kuma_collector and kuma_correlator sections of the inventory file, and leave one storage cluster in the kuma_storage section. If a host from the kuma_control_plane_master section has failed, then in inventory file k0s.inventory.yml, you need to swap it with another cluster controller from the kuma_control_plane section.
Install the current KUMA version using the install.sh script and the prepared k0s.inventory.yml inventory file:
sudo ./install.sh k0s.inventory.yml
Make sure that all cluster components are working properly and that high availability is restored:
1. All k0s services are running:
  sudo systemctl status <k0sworker/k0scontroller>
  
  sudo k0s status
2. Information about pods and all working nodes is available:
  - To view the status of a volume, run the following command:
    sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'
    
    The status must be healthy. If the status is degraded, then one of the replicas is unavailable or is being restored.
  - To monitor the progress of volume rebuilding, run the following command:
    sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'
    
    If it is working correctly, recovery is not being performed, and the command does not return anything. If the command returns the rebuilding status, it means that some replicas are in the process of being rebuilt. We recommend not making any changes to the cluster until the rebuilding is complete.

The cluster is restored.

Complete failure of cluster components with the KUMA Core unavailable

You must have a backup of the KUMA Core on hand.

To restore a cluster, you must first delete the cluster that is in use.

To restore a cluster:

Prepare new virtual machines or servers to replace failed cluster components in accordance with the KUMA installation requirements.
At this step, you can use snapshots of virtual machines taken before installing KUMA.
Prepare a separate k0s.inventory.yml inventory file for cluster deletion. In this inventory file, remove all hosts from the kuma_collector, kuma_correlator, and kuma_storage sections to avoid having to uninstall and then reinstall services.
Remove the failed cluster:
1. Run the uninstall.sh script with the k0s.inventory.yml inventory file prepared at step 2:
  sudo ./uninstall.sh k0s.inventory.yml
2. Restart all hosts from the kuma_worker* and kuma_control_plane* sections of the inventory file.
3. After starting the hosts from the kuma_worker* and kuma_control_plane* sections of the inventory file, run the uninstall.sh script again with the k0s.inventory.yml inventory file:
  sudo ./uninstall.sh k0s.inventory.yml
Prepare the KUMA inventory file for cluster recovery. Use your current inventory file as the basis. If you have many external services and you need to minimize installation time, you can leave one host in the kuma_collector and kuma_correlator sections of the inventory file, and leave one storage cluster in the kuma_storage section. If you do not need to minimize the installation time and restarting of external KUMA services is permissible, then you can use the inventory file without modifications.
Install the current KUMA version using the install.sh script and the prepared k0s.inventory.yml inventory file to restore the cluster:
Restore the KUMA Core from the backup.
Make sure that the KUMA Core and other KUMA services are working properly. To do this, go to the Resources → Active services section. All services must have the green status.
Make sure that all cluster components are working properly and that high availability is restored:
1. All k0s services are running:
  sudo systemctl status <k0sworker/k0scontroller>
  
  sudo k0s status
2. Information about pods and all working nodes is available:
  - To view the status of a volume, run the following command:
    sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'
    
    The status must be healthy. If the status is degraded, then one of the replicas is unavailable or is being restored.
  - To monitor the progress of volume rebuilding, run the following command:
    sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'
    
    If it is working correctly, recovery is not being performed, and the command does not return anything. If the command returns the rebuilding status, it means that some replicas are in the process of being rebuilt. We recommend not making any changes to the cluster until the rebuilding is complete.

The cluster is restored.

Failure of the traffic balancer

If the traffic balancer fails, the KUMA Core becomes unavailable. However, to restore the balancer, you do not need to delete or change the existing cluster or KUMA services. If you have a snapshot of the balancer virtual machine taken after installing KUMA, you can use this snapshot. If you do not have a virtual machine snapshot or if you want to use a new server to replace the failed one, to restore the cluster, you need to install the balancer and configure it using the previously saved configuration.

After the traffic balancer is recovered, access to the cluster and the KUMA Core is restored.

Page top