KUMA fault tolerance is ensured by implementing the KUMA Core into the Kubernetes cluster deployed by the KUMA installer, and by using an external TCP traffic balancer.
There are 2 possible roles for nodes in Kubernetes:
Learn more about the requirements for cluster nodes.
For product installations of the KUMA Core in Kubernetes, it is critically important to allocate 3 separate nodes with a single controller role. This will provide fault tolerance for the Kubernetes cluster and will ensure that the workload (KUMA processes and others) cannot affect the tasks associated with managing the Kubernetes cluster. If you are using virtualization tools, you should make sure that these nodes reside on different physical servers and ensure that there are no worker nodes on the same physical servers.
In cases where KUMA is installed for demo purposes, nodes that combine the roles of a controller and worker node are allowed. However, if you are expanding an installation to a distributed installation, you must reinstall the entire Kubernetes cluster while allocating 3 separate nodes with the controller role and at least 2 nodes with the worker node role. KUMA cannot be upgraded to later versions if there are nodes that combine the roles of a controller and worker node.
You can combine different roles on the same cluster node only for demo deployment of the application.
KUMA Core availability under various scenarios:
Access to the KUMA web interface is lost. After 6 minutes, Kubernetes initiates migration of the Core bucket to an operational node of the cluster. After deployment is complete, which takes less than one minute, the KUMA web interface becomes available again via URLs that use the FQDN of the load balancer. To determine on which of the hosts the Core is running, run the following command in the terminal of one of the controllers:
k0s kubectl get pod -n kuma -o wide
When the malfunctioning worker node or access to it is restored, the Core bucket is not migrated from its current worker node. A restored node can participate in replication of a disk volume of the Core service.
Access to the KUMA web interface is not lost via URLs that use the FQDN of the load balancer. The network storage creates a replica of the running Core disk volume on other running nodes. When accessing KUMA via a URL with the FQDN of running nodes, there is no disruption.
Worker nodes operate in normal mode. Access to KUMA is not disrupted. A failure of cluster controllers extensive enough to break quorum leads to the loss of control over the cluster.
Correspondence of the number of machines in use to ensure fault tolerance
Number of controllers when installing a cluster |
Minimum number of controllers required for the operation of the cluster (quorum) |
Admissible number of failed controllers |
---|---|---|
1 |
1 |
0 |
2 |
2 |
0 |
3 |
2 |
1 |
4 |
3 |
1 |
5 |
3 |
2 |
6 |
4 |
2 |
7 |
4 |
3 |
8 |
5 |
3 |
9 |
5 |
4 |
The cluster cannot be managed and therefore will have impaired performance.
Access to the KUMA web interface is lost. If all replicas are lost, information will be lost.