Failure scenarios

Temporary unavailability of individual cluster components

Unavailability that does not require recreating VMs or replacing servers for recovery. Examples: temporary power outage of the server, lapse of network connectivity, a failure that necessitated a restart of the server or VM. KUMA Core availability in this case is determined by the set of components that remain in operation. This also includes cases when failures of the server or VM can be quickly remedied by reconfiguring the software or replacing individual hardware parts that do not necessitate a complete reinstallation of the operating system.

After restoring the availability of all components, the health of the cluster is restored automatically, however, transient operations take some time to complete, such as the synchronization of volume replicas, during which the cluster remains vulnerable to new failures of other components. Thus, synchronization of replicas of large volumes can take several hours. This recovery time must be taken into account when planning an exercise with the deliberate shutdown of worker nodes.

Complete failure of cluster components while the availability of the KUMA Core is maintained

In this case, the cluster allows using the KUMA Core for some time (until the next component fails), choosing an appropriate time window for recovery, and making a fresh backup copy of the KUMA Core.

To restore a cluster:

  1. Prepare new VMs or servers to replace failed cluster components in accordance with the KUMA installation requirements.

    At this stage, you may use VM snapshots taken before the installation of KUMA.

  2. Review the k0s.inventory.yml inventory file and update it if necessary. If you have many services and you need to minimize installation time, you can leave one host in the kuma_collector and kuma_correlator inventory groups, and leave one storage cluster in the kuma_storage group. If a host from the kuma_control_plane_master group has failed, then in inventory file k0s.inventory.yml, you need to swap it with another cluster controller from the kuma_control_plane group.
  3. Perform the installation using the current version of the KUMA installer with the install.sh script and the prepared k0s.inventory.yml inventory file:

    sudo ./install.sh k0s.inventory.yml

  4. Make sure that all cluster components are working properly and that high availability is restored:
    1. All k0s services are running:

      sudo systemctl status <k0sworker/k0scontroller>

      sudo k0s status

    2. Information about pods and all working nodes is available:
      • To view the status of a volume, run the following command:

        sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'

        The status must be healthy. If the status is degraded, then one of the replicas is unavailable or is being restored.

      • To monitor the progress of volume restoration, run the following command:

        sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'

        Normally, there should not be an ongoing restoration. If a rebuild is in progress, it means that some of the replicas are being rebuilt and ideally, the cluster should not be modified until the rebuild is completed.

The cluster is restored.

Complete failure of cluster components with the KUMA Core unavailable

You must have a backup of the KUMA Core on hand.

To restore the cluster, you will need to first delete the old cluster.

To restore a cluster:

  1. Prepare new VMs or servers to replace failed cluster components in accordance with the KUMA installation requirements.

    At this stage, you may use VM snapshots taken before the installation of KUMA.

  2. Prepare a separate k0s.inventory.yml inventory file for cluster deletion. In this inventory file, remove all hosts from the kuma_collector, kuma_correlator, and kuma_storage groups to avoid having to uninstall and then reinstall services.
  3. Delete the failed cluster:
    1. Run the uninstall.sh script with the k0s.inventory.yml inventory file prepared at step 2:

      sudo ./uninstall.sh k0s.inventory.yml

    2. Restart all hosts from the kuma_worker* and kuma_control_plane* inventory file groups.
    3. After the hosts from the kuma_worker* and kuma_control_plane* groups are started, run uninstall.sh again with the same k0s.inventory.yml inventory file:

      sudo ./uninstall.sh k0s.inventory.yml

  4. Prepare the KUMA inventory file for cluster recovery. Use your current inventory file as the basis. If you have many external services and you need to minimize installation time, you can leave one host in the kuma_collector and kuma_correlator inventory groups, and leave one storage cluster in the kuma_storage group. If you do not need to minimize the installation time and restarting of external KUMA services is permissible, then you do not need to modify the inventory file.
  5. Perform the installation using the current version of the KUMA installer with the install.sh script and the k0s.inventory.yml inventory file prepared for cluster restoration:
  6. Restore the KUMA Core from the backup.
  7. Make sure that the KUMA Core and other KUMA services are working properly. To do this, go to the Resources → Active services section. All services must have the green status.
  8. Make sure that all cluster components are working properly and that high availability is restored:
    1. All k0s services are running:

      sudo systemctl status <k0sworker/k0scontroller>

      sudo k0s status

    2. Information about pods and all working nodes is available:
      • To view the status of a volume, run the following command:

        sudo k0s kubectl get volume -n longhorn-system -o json | jq '.items[0].status.robustness'

        The status must be healthy. If the status is degraded, then one of the replicas is unavailable or is being restored.

      • To monitor the progress of volume restoration, run the following command:

        sudo k0s kubectl get engine -n longhorn-system -o json | jq '.items[0].status.rebuildStatus'

        Normally, there should not be an ongoing restoration. If a rebuild is in progress, it means that some of the replicas are being rebuilt and ideally, the cluster should not be modified until the rebuild is completed.

The cluster is restored.

Failure of the traffic balancer

Complete failure of the traffic balancer is a separate case. Although its failure makes the cluster and the KUMA Core unavailable, it can be restored without removing or modifying the cluster in any way or affecting KUMA services. If you have a snapshot of the balancer VM created after the installation of KUMA, you can simply restore the VM from this snapshot. If you do not have a VM snapshot or if you have replaced a failed server, you simply need to install the balancer and apply a previously saved balancer configuration.

After the traffic balancer is recovered, access to the cluster and the KUMA Core is restored.

Page top