Kaspersky Unified Monitoring and Analysis Platform

About KUMA fault tolerance

December 4, 2023

ID 244722

KUMA fault tolerance is ensured by implementing the KUMA Core into the Kubernetes cluster deployed by the KUMA installer, and by using an external TCP traffic balancer.

There are 2 possible roles for nodes in Kubernetes:

  • Controllers (control-plane)—nodes with this role manage the cluster, store metadata, and distribute the workload.
  • Workers—nodes with this role bear the workload by hosting KUMA processes.

Learn more about the requirements for cluster nodes.

For product installations of the KUMA Core in Kubernetes, it is critically important to allocate 3 separate nodes with a single controller role. This will provide fault tolerance for the Kubernetes cluster and will ensure that the workload (KUMA processes and others) cannot affect the tasks associated with managing the Kubernetes cluster. If you are using virtualization tools, you should make sure that these nodes reside on different physical servers and ensure that there are no worker nodes on the same physical servers.

In cases where KUMA is installed for demo purposes, nodes that combine the roles of a controller and worker node are allowed. However, if you are expanding an installation to a distributed installation, you must reinstall the entire Kubernetes cluster while allocating 3 separate nodes with the controller role and at least 2 nodes with the worker node role. KUMA cannot be upgraded to later versions if there are nodes that combine the roles of a controller and worker node.

You can combine different roles on the same cluster node only for demo deployment of the application.

KUMA Core availability under various scenarios:

  • Malfunction or network disconnection of the worker node where the KUMA Core service is deployed.

    Access to the KUMA web interface is lost. After 6 minutes, Kubernetes initiates migration of the Core bucket to an operational node of the cluster. After deployment is complete, which takes less than one minute, the KUMA web interface becomes available again via URLs that use the FQDN of the load balancer. To determine on which of the hosts the Core is running, run the following command in the terminal of one of the controllers:

    k0s kubectl get pod -n kuma -o wide

    When the malfunctioning worker node or access to it is restored, the Core bucket is not migrated from its current worker node. A restored node can participate in replication of a disk volume of the Core service.

  • Malfunction or network disconnection of a worker node containing a replica of the KUMA Core drive on which the Core service is not currently deployed.

    Access to the KUMA web interface is not lost via URLs that use the FQDN of the load balancer. The network storage creates a replica of the running Core disk volume on other running nodes. When accessing KUMA via a URL with the FQDN of running nodes, there is no disruption.

  • Loss of availability of one or more cluster controllers when quorum is maintained.

    Worker nodes operate in normal mode. Access to KUMA is not disrupted. A failure of cluster controllers extensive enough to break quorum leads to the loss of control over the cluster.

    Correspondence of the number of machines in use to ensure fault tolerance

    Number of controllers when installing a cluster

    Minimum number of controllers required for the operation of the cluster (quorum)

    Admissible number of failed controllers

    1

    1

    0

    2

    2

    0

    3

    2

    1

    4

    3

    1

    5

    3

    2

    6

    4

    2

    7

    4

    3

    8

    5

    3

    9

    5

    4

  • Simultaneous failure of all Kubernetes cluster controllers.

    The cluster cannot be managed and therefore will have impaired performance.

  • Simultaneous loss of availability of all worker nodes of a cluster with replicas of the Core volume and the Core pod.

    Access to the KUMA web interface is lost. If all replicas are lost, information will be lost.

Did you find this article helpful?
What can we do better?
Thank you for your feedback! You're helping us improve.
Thank you for your feedback! You're helping us improve.