How to handle network partitions decisively with Akka Split Brain Resolver
Introduction to Akka Split Brain Resolver (SBR)
Akka Split Brain Resolver (Akka SBR) is one of several new commercial features available to Typesafe Reactive Platform (RP) users. RP is for free in development and requires a subscription for production usage. Akka SBR expands your system’s available actions when nodes in your cluster become unreachable, reacting in the best way to address the nature of the failure and selecting a recovery path.
When Akka SBR recognizes that nodes in your cluster have become unreachable, it judges the scenario and decides how to proceed based on the cluster’s configuration which has been chosen based on the consistency and availability needs of your application. Handling strategies include Static Quorum, Keep Majority, Keep Oldest, and Keep Referee.
Historically, this job is handled manually by operations staff, often on a 24-hour watch to ensure resiliency in mission-critical applications. To add to your system's resiliency, Akka SBR can automate this process by applying a preconfigured resolution strategy that take measures to restore the remaining cluster nodes to fully operational conditions where possible.
Note: silver bullet not included with any strategy. Akka SBR is a tool designed to prevent you from making the wrong decision, not magically eradicating network partitions and crashes.
I'm new to distributed systems––what's the problem with network partitions?
Distributed, Reactive applications are precisely that–resilient, distributed instances of applications or microservices in many places across a network. When a traditional system experiences a failure, it's possible for a system-wide crash to occur, but with Reactive systems mechanisms are in place to isolate the offending application or cluster node and restart a new instance somewhere else. And users notice nothing.
However, there exists a fundamental problem in distributed systems: Operations cannot tell if an unresponsive node is the result of a partition in the network (what we call "split brain" scenarios) or due to an actual machine crash. These are indistinguishable, as a node may see that there is a problem with another node, but it cannot tell if it has crashed and will never be available again or if there is a network issue that might or might not heal again after some time. Another indistinguishable problem is when a process becomes unresponsive for other reasons (i.e. overload, CPU starvation or long garbage collection pauses).
What we have now is a heartbeat signal that simply says "no reply in the allotted time", meaning that we only have a single method for dealing with various issues whose causes are all mysterious for the observer. The nodes on different sides of the network partition cannot communicate with each other, and it's important that both sides of the network partitions can have resolution strategies at their disposal. For example, with a machine crash you want to remove the node immediately, but during a network partition it's best to pause and hope for self-healing capabilities can save the day. These are totally opposite strategies. There are other issues to get in to here as well, which we look at further on.
What does Akka SBR do for network partitions?
As we noted above, there is not a "one size fits all" solution to this problem. Akka SBR offers multiple strategies that fits the characteristics of your system. If you missed the video link above, you can go directly to Konrad Malawski presenting Akka SBR on YouTube.
The strategy named static-quorum will down the unreachable nodes if the number of remaining nodes are greater than or equal to a configured quorum-size. Otherwise it will down the reachable nodes, i.e. it will shut down that side of the partition. In other words, the quorum-size defines the minimum number of nodes that the cluster must have to be operational.
When to use it: This strategy is a good choice when you have a fixed number of nodes in the cluster, or when you can define a fixed number of nodes with a certain role.
The strategy named
keep-majority will down the unreachable nodes if the current node is in the majority part based on the last known membership information. Otherwise down the reachable nodes, i.e. the own part. If the parts are of equal size the part containing the node with the lowest address is kept.
When to use it: This strategy is a good choice when the number of nodes in the cluster change dynamically and you can therefore not use
This strategy named
keep-oldest will down the part that does not contain the oldest member. The oldest member is interesting because the active Cluster Singleton instance is running on the oldest member. There is one exception to this rule if
down-if-alone is configured to on. Then, if the oldest node has partitioned from all other nodes the oldest will down itself and keep all other nodes running. The strategy will not down the single oldest node when it is the only remaining node in the cluster.
Note that if the oldest node crashes the others will remove it from the cluster when down-if-alone is on, otherwise they will down themselves if the oldest node crashes, i.e. shut down the whole cluster together with the oldest node.
When to use it: This strategy is good to use if you use Cluster Singleton and do not want to shut down the node where the singleton instance runs. If the oldest node crashes a new singleton instance will be started on the next oldest node. The drawback is that the strategy may keep only a few nodes in a large cluster. For example, if one part with the oldest consists of 2 nodes and the other part consists of 98 nodes then it will keep 2 nodes and shut down 98 nodes.
The strategy named
keep-referee will down the part that does not contain the given referee node. If the remaining number of nodes are less than the configured down-all-if-less-than-nodes, then all nodes will be downed. If the referee node itself is removed all nodes will be downed.
When to use it: This strategy is good if you have one node that hosts some critical resource and the system cannot run without it. The drawback is that the referee node is a single point of failure, by design, and as such
keep-referee will never result in two separate clusters.
Maintaining persistence in your Akka cluster
There are some things to know about Akka Cluster Singleton and Cluster Sharding.
If you use the timeout based auto-down feature in combination with Cluster Singleton or Cluster Sharding that would mean that two singleton instances or two sharded entities with same identifier would be running. One would be running: one in each cluster. For example, when used together with Akka Persistence that could result in that two instances of a persistent actor with the same persistenceId are running and writing concurrently to the same stream of persistent events, which will have fatal consequences when replaying these events.
If the unreachable nodes are not downed at all they will still be part of the cluster membership. Meaning that Cluster Singleton and Cluster Sharding will not failover to another node. While there are unreachable nodes new nodes that are joining the cluster will not be promoted to full worthy members (with status Up). Similarly, leaving members will not be removed until all unreachable nodes have been resolved. In other words, keeping unreachable members for an unbounded time is undesirable.
How to increase resilience with additional self-healing features
Part of the Reactive Platform subscription, Typesafe ConductR is an additional tool for Operations that play a valuable role with Akka SBR. ConductR extends Akka's native resilience by orchestrating the lifecycles of nodes and bundles (e.g. app instances / microservices), including automated recovery for failed app bundles, failed nodes and, with Akka SBR, network partitions as well. In addition, ConductR automates Akka cluster set up across your system, removing the tedious need for manual configuration. To get started with Akka SBR and learn more about ConductR, just check out Reactive Platform.