Advanced self-healing for cloud applications

When running distributed systems, especially in the cloud, it is inevitable that network partitions will occur, which are commonly referred to as split brain syndrome.

Split brain syndrome, in a clustering context, is a state in which a cluster of nodes gets divided (or partitioned) into smaller clusters, each of which believes it is the only active cluster. Believing the other clusters are dead, each cluster may simultaneously access the same application data or disks, which can lead to data corruption.

Production Suite enhances Akka Cluster resilience and prevents data loss with predefined resolution strategies for recovering unreachable nodes during network partitions.

heartbeats-v01

Features
and Capabilities

Reduce overhead

By automatically applying preconfigured resolution strategies, recovering failed nodes no longer requires manual intervention by operations staff, often on a 24-hour watch to ensure resiliency in mission-critical applications.

Apply the best strategy

Because there is no “one size fits all” solution to this challenge, multiple strategies are offered to best fit the characteristics of the system: Static Quorum, Keep Majority, Keep Oldest, and Keep Referee.

Static Quorum

This strategy is a good choice when there are a fixed number of nodes in the cluster, or when a fixed number of nodes with a certain role can be defined.

Keep Majority

This strategy is a good choice when the number of nodes in the cluster change dynamically and therefore Static Quorum cannot be used.

Keep Oldest

This strategy is good to use with Cluster Singleton. If the oldest node crashes a new singleton instance will be started on the next oldest node.

Keep Referee

This strategy is good if when one node hosts a critical resource that the system cannot run without.