Redundancy of central components of the solution

April 9, 2024

ID 239027

Kaspersky SD-WAN supports two component deployment schemes: N+1 and 2N+1.

The N+1 deployment scheme means that one backup component is deployed alongside an active component. If the active component fails, the backup component instantly takes its place, ensuring continuity of operation.

The 2N+1 deployment scheme is an expanded version of N+1 and differs in that it has an additional level of redundancy. In this scheme, the active component consists of two sets. These are synchronized with each other, and one can take the place of the other if a malfunction occurs. One extra backup component is also deployed. This redundancy scheme allows components to remain operational even when multiple failures occur in a row.

The table below shows the redundancy schemes and protocols that are used for different components of the solution.

Redundancy schemes for components of the solution

Component

Redundancy scheme

Protocol used

Orchestrator

N+1

REST

Orchestrator web interface

N+1

REST

Orchestrator database

2N+1

MONGODB

SD-WAN Controller and its database

2N+1

OPENFLOW (TLS)

SD-WAN Gateway

N+1

GENEVE

An example of locating solution components in geographically dispersed data centers is shown in the figure below. All subsequent figures use the same symbols:

  • orchestrator — orc
  • orchestrator web interface — www
  • orchestrator database — orc-dbs
  • SD-WAN Controller and its database — ctl
  • SD-WAN gateway — GW

For components of the solution that are N+1 redundant, two nodes are deployed in separate data centers. Each of the nodes is in the active state. You can use a virtual IP address or DNS service to select the node to which requests are directed.

The diagram shows three interconnected locations with solution components.

Placing solution components in geographically dispersed data centers

Components that are 2N+1 redundant form a cluster. This cluster contains one primary node and two nodes providing redundancy. You can designate one of the nodes as an arbiter to economize resources and reduce the requirements for the links.

If a cluster node is designated as an arbiter, it does not contain a database and you cannot make it the primary node. The arbiter node takes part in voting when the primary node is selected and exchanges periodic service packets (heartbeats) with other nodes.

The figure below shows an example of an failure at one of the locations and how the solution responds to it. This example shows an accident in which the nodes of the solution component cluster fail at location 1.

The diagram shows three interconnected locations. An accident causes location 1 to fail.

Accident at location 1

If nodes of the solution component cluster at location 1 fail, the following events occur:

  1. Node orc-dbs 2 and arbiter node orc-dbs 3 lose contact with node orc-dbs 1, and subsequently vote for a new primary node.
  2. Arbiter node orc-dbs 3 cannot be the primary node, therefore orc-dbs 2 becomes the primary node and informs the orchestrator of its role.
  3. Node ctl 2 and arbiter node ctl 3 lose contact with node ctl 1, and subsequently vote for a new primary node.
  4. Arbiter node ctl 3 cannot be the primary node, therefore ctl 2 becomes the primary node and informs the orchestrator of its role.

The figure below shows an accident in which the nodes of the solution component cluster fail at location 2.

The diagram shows three interconnected locations. An accident causes location 2 to fail.

Accident at location 2

If nodes of the solution component cluster at location 2 fail, the following events occur:

  1. Node orc-dbs 1 and arbiter node orc-dbs 3 lose contact with node orc-dbs-2, after which node orc-dbs 1 remains the primary node.
  2. Node ctl 1 node and arbiter node ctl 3 lose contact with node ctl 2, after which node ctl 1 remains the primary node.

The figure below shows an example of an accident in which the connection between location 1 and location 2 is severed.

The diagram shows three interconnected locations. An accident causes the connection between location 1 and location 2 to fail.

Connection failure between location 1 and location 2

If cluster nodes of solution components at location 1 and location 2 cannot connect to each other, the following events occur:

  1. Node orc-dbs 1 loses contact with node orc-dbs 2.
  2. Node orc-dbs 1 node remains the primary node because arbiter node orc-dbs 3 observes both locations operating normally.
  3. Node ctl 1 loses contact with node ctl 2.
  4. Node ctl 1 remains the primary node because arbiter node ctl 3 observes both locations operating normally.

The figure below shows an example of an accident in which the connection between location 1 and other locations is severed.

The diagram shows three interconnected locations. An accident causes the connections between location 1 and location 2, as well as between location 1 and location 3 to fail.

Failure of connections between location 1 and other locations

If cluster nodes of solution components at location 1 cannot connect to other locations, the following events occur:

  1. Node orc-dbs 1 loses contact with node orc-dbs 2.
  2. Node orc-dbs 2 becomes the primary node and informs the orchestrator of its role because the arbiter node orc-dbs 3 observes that location 1 is unavailable.
  3. Node ctl 1 loses contact with node ctl 2.
  4. Node ctl 2 becomes the primary node and informs the orchestrator of its role because arbiter node ctl 3 observes that location 1 is unavailable.

Did you find this article helpful?
What can we do better?
Thank you for your feedback! You're helping us improve.
Thank you for your feedback! You're helping us improve.