How Akka Works: 'Exactly Once' Message Delivery
About This Series
This article is the last in our "How Akka Works" series that dives into some of the interesting aspects of messaging within distributed systems environments. In this part, we take a look at exactly-once messaging. To understand the mechanics of exactly-once messaging you need to have a reasonably good grasp of the fundamentals of messaging in a distributed systems environment. Please see Part 1: At Most Once and Part 2: At Least Once of this series for a review of at-most-once and at-least-once messaging. Also see Message Delivery Reliability provided in the Akka documentation.
The Basic Mechanics of Exactly-Once Messaging
Let's start with an example scenario as a way to understand the mechanics of exactly-once messaging in a distributed environment. In this example scenario, we will walk through a conversation between you and me. Let's say we both are responsible for handling orders. You are responsible for taking orders, and I am responsible for shipping the orders. We communicate with each other via text messages. Think of this as a design exercise where you and I are walking through the design of two services, the order and shipping services.
When you receive a request for a new order, you send me a text message about that order, shown in Figure 1. When I receive a new order text message, I need to send you a text message that acknowledges that I have received your text message, shown in Figure 2. Since we are focusing on just the mechanics of the messages here, we will ignore most of the internal details of handling and shipping orders.
In the normal sequence of processing an order, you send me a text message for each new order. When I receive these new order text messages, I send you an acknowledgment reply text message. When each order is shipped, I send you an order shipped text message, shown in Figure 3.
It is essential that each message from you to me and from me to you is delivered because each message results in a state change in the receiver. When I receive a new order message from you, this triggers the order shipping process. If any of these new order messages are not delivered to me, those orders are not shipped, which of course is unacceptable. Therefore you and I need to work out a protocol that ensures that all messages are delivered.
Now let's look at what happens when the normal processing flow does not occur for some reason starting with the initial new order message from you to me. Multiple conditions will arise where a given new order text message will not be delivered to me. My phone could stop working. I could stop working, shown in Figure 4. There could be a problem with the network.
You become aware of failures to deliver a given message to me either via some form of a message delivery error or after some timeout period expires. In either case, an error or timeout, you at some point know that a given message was not handled as expected.
It is worth taking a closer look at what your options are when new order text messages are not acknowledged. First, it does not matter if you received an error or a timeout. The simple fact of the matter is is that there is no way for you to know if I received the text message or not. I may have received the message, and then the network failed immediately after the message arrived. I may have received the message, performed some state change operations, and before I could acknowledge it my phone failed or I was unable to send a reply for some other reason.
We both know that it is essential that every new order is processed and shipped. Therefore it is essential that each new order text message is delivered to me. We also both know that the only way for you to know that I have received a new order text message is for me to send you an acknowledgment text message.
So what do you do when you do not receive an acknowledgment message from me? The only real option is for you to try to send me the message again and again until the message is finally acknowledged, shown in Figure 6. Or you try for a while and if the message cannot be delivered you give up and tell the customer sorry, we are experiencing technical difficulties, please try again later, which is a completely unacceptable response these days, shown in Figure 7.
Before we continue into the details for handling message delivery failures, let's take a moment to consider exactly-once delivery. Is there an exactly-once solution that will fix this problem for us? The simple answer is no. Implementing an exactly-once end-to-end messaging system between two separated parties is impossible. The simple fact of the matter is that this is at least a five-step process.
- State change that triggers the decision to send a message
- Send the request message
- Receive and process the message, which results in a state change reaction
- Send a response message reply to the sender
- Receive and process the reply, which results in a state change reaction
It does not matter if you use a synchronous or asynchronous messaging flow. As shown in Figure 8, the reality is that there are five distinct steps and failures may occur at any one of the steps. But fear not, there are ways to implement effectively-once message delivery mechanisms, and we will look at some of the ways this is done in the following sections.
The Two Generals Problem
An excellent demonstration of the challenges of distributed messaging is a thought experiment known as the Two Generals Problem. What is shown in this analysis is that there is no way to guarantee state consistency between two endpoints when any form of two way communication is used where message delivery failures may occur.
In this article, we have been using an order processing example where you and I are the endpoints. You handle orders, and I am responsible for order shipping. We each maintain state for each order that we are processing.
In the two generals scenario let's pretend that you and I are the two generals. We are planning on conducting a coordinated attack on a single enemy. As it happens, your army is located in one valley, the enemy is in the next valley, and my army is located in a third valley over a ridge from the enemy.
It is essential that we both attack the enemy at the same time. Jointly we have sufficient numbers of soldiers and resources required to defeat the enemy. However, if only one of our armies attacks the enemy alone, we will be defeated.
The challenge is that we have not yet agreed on a specific time to attack. We must communicate with each other via messengers to decide on when to attack. The dilemma is that the messengers must pass through enemy territory to deliver a message. Obviously, there is no guarantee that a given message will be delivered.
Say you decide that the time to attack is tomorrow at 8 am. This is essentially a state change. You are in the “let's attack at eight tomorrow morning” state. You then dispatch a messenger with this information - “we attack tomorrow at 8 am”. At this point, you are waiting for my response. Without an acknowledgment from me that I agree with your plan you cannot proceed.
In our five-stage journey, you have completed the first stage, your state change decision on when to attack. Stage two involves the messenger delivering the message from you to me. Following the happy path first, I do receive the message. In stage three I make the state change decision to agree or not agree to your request to attack. In stage four I dispatch a messenger to deliver my reply message to you. Finally, you get my reply from me. In my reply, I've either agreed or rejected your proposed attack time, which completes stage 5. Here is the five-stage messaging journey:
- You have completed the first stage, your state change decision on when to attack
- You dispatch a messenger to deliver your message to me
- I receive the message from you and make the decision to agree with your proposal
- I dispatch a messenger to respond back to you that I have agreed to your request
- You receive my reply and now know that we both agree on the time to attack
In this happy path example, we already have a serious problem. I have no way of knowing if you received my reply. How can I attack when I am unsure if you know that I have agreed to your proposed time?
There are at least two possibilities here. One possibility is that you did get my reply and of course the other possibility is that the reply messenger was captured or worse and the message was never delivered. In either case, I have no way of knowing what happened.
There is another more sinister possibility. The enemy captured the messenger. Then the enemy alters the message, say my reply was “I agree, we attack at 8 am”. But the message is altered to “8 am tomorrow is too soon, what about the next day?” Then the messenger is forced or bribed or replaced, and the altered message is delivered to you.
The point is that many things can go wrong just with my reply to you.
What do you do when you do not get a reply from me? In this case, we have another serious dilemma. You do not know if I have received your message or not. There are at least three possibilities here. One is that your messenger was captured and your message was not delivered to me. The second possibility is that I did receive the message, but for some reason, I was unable to send a reply. Finally, the third possibility is that I did send a reply, but my messenger was captured.
Is there a way to fix this communication problem?
One possible approach is that we require that each messenger delivers a message and then returns to the sender to verify that the message was delivered. When a message is dispatched, we wait for a finite period for the messenger to return. If the messenger does not return before the return wait time has expired, we send another messenger. We repeat this process over and over until we finally get a successful reply.
Will this modified message delivery approach work?
The short answer is no. The problem is that with this approach the message sender can know when a message was sent because the message delivery has been acknowledged. However, the message receiver does not know if a message was acknowledged.
Consider this scenario. You send a message to me, I get the message, and the messenger returns to you. The first problem is that I do not know if the messenger returned to you or not. What this means is that I can expect to see the same message from you more than once. In this scenario that is not a problem if I receive the same message multiple times.
The problem is with my response message back to you. You do not know if my messenger returned to me. You can expect that I may send my reply to you more than once because I'm using the technique of sending a messenger and waiting for a timeout period before sending another messenger. However, you cannot know if I have ever received an acknowledgement.
What this means is that we continue to have a significant communications problem. If my reply messenger does not return to me, I cannot attack as planned. At the same time, you are never sure if I will attack because you do not know if my messenger has returned to me with an acknowledgement.
Why Exactly-Once is Impossible
After walking through the Two Generals Problem, you can see that reliable message delivery is challenging. We tried to solve the problems between the two generals, the two communication endpoints, using at-least-once message delivery techniques, and we still were unable to come up with a thoroughly reliable and workable solution.
The reality is that when message producers push messages to message consumers, there are unsolvable failure scenarios that cannot be resolved. When you send/push a message to me, you have no way of knowing if I received the message or not. When I do receive your message, I have no way of knowing if you received my reply or not.
You the message sender and I the message receiver can know that there is a problem, but we cannot know in all failure conditions what happened on the other side of the wire. This is a fundamental law of the physics of distributed message communication that cannot be solved.
Often Perseverance Pays Off
It would be wonderful if there were a workable exactly-once messaging solution. Ideally, we would like to exchange messages in the same way we invoke a method or function. Just give us a reliable remote procedure call, and we will be happy. What can be so hard about that?
As is often the case, there are many ways to solve software problems. Sometimes what we need to do is step back and evaluate what we are trying to accomplish. Our order processing scenario is not like the two generals' problem. With the two generals, both parties need to coordinate their actions. In our order processing example, we merely have to perform a series of steps one after the other. Our only coordination requirement is that all of the required steps must be eventually completed.
In the case of our order and shipping scenario what we need is to exchange messages between the order and shipping services. An essential requirement is that no messages can be lost. It would be nice to have an exactly-once solution available, but it is not an absolute requirement.
Our intuition drove us towards a push approach. You send me a text message when new orders are created. I send you a text message when I've started the packing process and another message when each order is shipped.
As we have learned so far in Part 2 of this series, there are a lot of reliability problems with this push approach. The most basic problem is that sometimes messages are delivered, and sometimes messages are not delivered. Also, there is the uncertainty of not knowing what happened on the other side of the wire.
But we can make the push approach work - with some terms and conditions. First, you must implement a message retry approach. You keep trying to send me each message until you receive a reply from me. The Ts&Cs here is that you need to harden the retry process to the point that failures and restarts on your end do not result in your losing any messages. To do this, you will need some form of resilience on your delivered messages list, as shown above in Figure 11. These are all solvable problems, but it does add a level of complexity to your message sending processing.
On my end, I have to handle potentially receiving the same message more than once. As we have discussed, when using the push/retry approach this results in the receiver receiving some messages one or more times. Handling the same message multiple times is also a solvable problem. Again, this takes some additional work on my end to handle this.
So the message push/retry and message receive one or more times is doable but it is more complex than your typical HTTP REST implementation.
What about Pull vs. Push Messaging?
Ok, so the push messaging approach is solvable but somewhat complex when it comes to reliable messaging. What about the pull approach? The pull approach is slightly counter-intuitive, but it is typically less complicated to implement. Both the push and pull approaches were covered in detail in part 2 of this series so please refer to that document for more details.
The push and pull approaches provide ways for implementing at-least-once delivery while the commonly used synchronous HTTP REST approach without retry offers at-most-once delivery, as discussed in part 1.
What about exactly-once delivery? As already stated an end-to-end general purpose exactly-once message delivery process is physically impossible to implement. However, it is possible to achieve what appears to be exactly-once messaging with techniques that are referred to as essentially-once.
The essentially-once message approach is a matter of perspective. On the receiving end what can be done is that the message receiver does not see duplicate messages, which effectively simulates exactly-once message delivery from, again only from the perspective of the message receiver. However, in between the message sender and the message receiver, we are going to have to implement some “magic” to make this happen.
First, let's set the playing field in our order and shipping example scenario. On your order processing end, you store the state of orders in a local persistence store. On my end, I've got another local to me persistence store for maintaining the state of the order shipping processes. In between, we have a message bus, such as Kafka, Pulsar, ActiveMQ, and many other pub-sub and queue brokers. To be clear, we each have our independent persistence stores, and we cannot perform any single transactions that spans our two persistence stores.