Temporary unavailability of individual cluster components
If individual cluster components are temporarily unavailable, for example, due to a temporary server power outage, lack of network connectivity, or a failure that required restarting the server or virtual machine, recovery does not require creating virtual machines from scratch or replacing servers. KUMA Core availability in this case is determined by the set of components that remain in operation. This scenario also covers failures on a server or virtual machine that do not require reinstalling the operating system and that can be quickly remedied by replacing individual hardware components or changing the software configuration.
After restoring the availability of all components, the health of the cluster is restored automatically, however, transient operations take some time to complete, such as the synchronization of volume replicas, during which the cluster remains vulnerable to new failures of other components. Thus, synchronization of replicas of large volumes can take several hours. This recovery time must be taken into account when planning an exercise with the deliberate shutdown of worker nodes.
Complete failure of cluster components while the availability of the KUMA Core is maintained
In this case, the cluster allows using the KUMA Core for some time (until the next component fails), choosing an appropriate time window for recovery, and making an up-to-date backup copy of the KUMA Core.
To restore a cluster:
At this step, you can use snapshots of virtual machines taken before installing KUMA.
kuma_collector and kuma_correlator sections of the inventory file, and leave one storage cluster in the kuma_storage section. If a host from the kuma_control_plane_master section has failed, then in inventory file k0s.inventory.yml, you need to swap it with another cluster controller from the kuma_control_plane section.sudo ./install.sh k0s.inventory.yml
sudo systemctl status <k0sworker/k0scontroller>
sudo k0s status
sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'
The status must be healthy. If the status is degraded, then one of the replicas is unavailable or is being restored.
sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'
If it is working correctly, recovery is not being performed, and the command does not return anything. If the command returns the rebuilding status, it means that some replicas are in the process of being rebuilt. We recommend not making any changes to the cluster until the rebuilding is complete.
The cluster is restored.
Complete failure of cluster components with the KUMA Core unavailable
You must have a backup of the KUMA Core on hand.
To restore a cluster, you must first delete the cluster that is in use.
To restore a cluster:
At this step, you can use snapshots of virtual machines taken before installing KUMA.
kuma_collector, kuma_correlator, and kuma_storage sections to avoid having to uninstall and then reinstall services.sudo ./uninstall.sh k0s.inventory.yml
kuma_worker* and kuma_control_plane* sections of the inventory file.kuma_worker* and kuma_control_plane* sections of the inventory file, run the uninstall.sh script again with the k0s.inventory.yml inventory file:sudo ./uninstall.sh k0s.inventory.yml
kuma_collector and kuma_correlator sections of the inventory file, and leave one storage cluster in the kuma_storage section. If you do not need to minimize the installation time and restarting of external KUMA services is permissible, then you can use the inventory file without modifications.sudo systemctl status <k0sworker/k0scontroller>
sudo k0s status
sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'
The status must be healthy. If the status is degraded, then one of the replicas is unavailable or is being restored.
sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'
If it is working correctly, recovery is not being performed, and the command does not return anything. If the command returns the rebuilding status, it means that some replicas are in the process of being rebuilt. We recommend not making any changes to the cluster until the rebuilding is complete.
The cluster is restored.
Failure of the traffic balancer
If the traffic balancer fails, the KUMA Core becomes unavailable. However, to restore the balancer, you do not need to delete or change the existing cluster or KUMA services. If you have a snapshot of the balancer virtual machine taken after installing KUMA, you can use this snapshot. If you do not have a virtual machine snapshot or if you want to use a new server to replace the failed one, to restore the cluster, you need to install the balancer and configure it using the previously saved configuration.
After the traffic balancer is recovered, access to the cluster and the KUMA Core is restored.
Page top