akka akka-edge akka-projections

Deep-dive of Event-Driven communication between Edge and Cloud

Patrik Nordwall Akka Tech Lead, Lightbend, Inc.

After extending Akka to connect multiple Akka clusters in Akka 23.05 easily, we took on the quest to stretch it out even further and take Akka to the edge of the cloud. Let me go through some of the technical challenges we faced and how we complemented the existing Akka features to push the boundaries of what Akka can solve.

The backbone of our solutions is Akka Projections and the brokerless service-to-service communication. It delivers events reliably from the edge to the cloud and in the other direction from cloud to edge.

In these systems, event-sourced actors are the source of truth. Each such actor, often called an entity, has an identity and is a transactional boundary. All state changes of the entity are stored as events in an event journal that can be replayed to recover the entity's state. However, events provide us with much more than the entity's state. They also enable event-driven communication between services, independent of where they run.

Akka gives every event a monotonically increasing sequence number without a gap to the previous event of the same entity. This property of the sequence numbers is something we use in several places to keep track of which events have been processed.

Projections read events from the database, and after processing, it stores the event's offset. After a failure or other reason for restarting the projection, it can resume from a previously stored offset. The offset consists of three things: the entity ID, the timestamp of the event, and the sequence number. Tracking offsets in detail makes it possible to use the stored offsets for deduplication and validation of event ordering.

For example, this makes it possible to not only read the events from the database but also, as a low latency path, send the events directly to the event consumers after writing to the journal. Sending events directly like this is not fully reliable but gives very low latency for normal cases. In case of abnormalities, the events are reliably delivered by the second stream of events from the database. Duplicate or unexpected sequence numbers are discarded by comparing them with previously processed and stored sequence numbers.

This works in the same way for both local and remote projections. For the remote projections, the events are delivered over an internal gRPC protocol between the event producer and event consumer. That’s great; we can use those projections over gRPC for the Edge.

Well, it’s not that simple. We had to solve some challenges first.

Network connections can only be established from the Edge.
Edge services may be disconnected for a long while.
Edge services may only have access to transient storage.
Scaling to many edge services.

Connecting from edge

Due to network topology, firewall rules, NAT, and such, we can only assume that connections can be established from the edge to the central cloud and not the opposite. When an edge is the consumer of events, that works well with Akka Distributed Cluster the edge is the producer, we need another solution.

The connection from the edge is routed via a load balancer to one of the backend nodes in the cloud. We can’t control which node it ends up on. Starting the consumer side of the projection where the connection arrives will not work because, in error scenarios, it would require difficult coordination of the processing that might already be in flight on another node.

Instead, we chose a two-step approach. The events are first pushed over gRPC from the edge to the cloud and stored in the event journal on the cloud consumer side. There is no real consumer-side processing of the events yet aside from filtering and transformations. Secondly, an ordinary local projection on the consumer side is used for processing the events. We were able to optimize the writing of the events by implementing the event journal protocol instead of delegating to event-sourced actors.

This approach also has the advantage that the cloud doesn’t have to know about edge locations since it’s always the edge that connects to the cloud.

Question: Can Akka Edge services not talk to other Edge services?

Answer: Yes, if your network topology allows that. It might be difficult to allow incoming connections to an edge service, and that is why we assume that the edge service is the one that establishes the connection to another service. The actual events can still flow in either direction. Another solution could be if you have a VPN connection that allows you to connect to edge services. Then, you can skip the producer push step and let the consumer connect to the producer.

Question: Can I use Replicated Event Sourcing with Akka Edge?

Answer: Replicated Event Sourcing requires gRPC connectivity in both directions and is therefore unsupported for Akka Edge. An additional filtering mechanism for Replicated Event Sourcing with many edge replicas is also needed. We will look into these problems and might support it in the future.

Speed up startup time with snapshots

Edge services may only need part of the history of events when starting up for the first time or after being disconnected from the cloud for a while. As an optimization, to support this, we have added the possibility of using snapshots as starting points for projections. This works for both local and remote projections. The event-sourced actor may store snapshots at regular intervals or application-specific points “in time,” e.g., after a certain event. When starting a projection, those snapshots are read and transformed into an event type. After the snapshot event, subsequent events are delivered as usual.

Question: You mentioned that the edge can be disconnected for a long while. What happens with the events when it’s disconnected?

Answer: The edge service can continue to store more events in its journal. When the connection can be established again, the events will be delivered to consumers in the cloud. Snapshots can be used in the edge service, too, so it will start from the snapshot and push the snapshot event and subsequent events. When the edge is a consumer, it will retrieve the latest snapshots and events when the connection is established again.

Question: When using dynamic filters, and a new entity is included, there can be historical events that are retrieved on demand. How does that work with snapshots?

Answer: Those requests will also start from the previous snapshot and then replay events after the snapshot.

Lightweight storage

Some edge environments may not have the resource capabilities to run a Postgres database, or you may want to reduce operational complexity to a minimum at the edge. We have added support for H2 in the R2DBC plugin for Akka Persistence and Projections. H2 is a database that runs embedded in the JVM. It is fast and has low overhead since there is no network involved. You can use this as an event journal when producing events at the edge or as an offset store and application-specific database when consuming events at the edge.

Question: Won’t H2 use much memory over time when using event sourcing?

Answer: H2 is often thought of as an in-memory database, but to reduce memory footprint, we recommend using it with the file storage mode.

Question: Durable State can be a good way to store current value at the edge. Can the H2 dialect be used for that?

Answer: Yes, it has full feature parity with the Postgres dialect.

Scaling to many edge consumers

In edge systems, there are typically many edge services connected to the same cloud service. They might be interested in the same events or a filtered subset of the same events. Each projection will result in a query to the database that is repeated once every few seconds to retrieve new events. With many consumers, the database will become a bottleneck, and much of the same work will be repeated for each consumer. As an optimization, we implemented a “firehose query.” The consumers will attach to the shared firehose query, which will read the events from the database only once and then fan out to the consumers.

A consumer is started from an offset, which may be reading from an earlier point in time than the current firehose events, and therefore, it must first catch up and then switch over to the firehose. Deciding on the exact point of switchover involves race conditions between the two streams, which we solved by using events from both catchup and firehose streams to make sure that no events are missed. Once again, the sequence numbers came to the rescue for deduplication.

Another challenge is that not all consumers are equally fast when consuming from the firehose stream. To accommodate for some variations, we allow for buffering of events between the fastest and slowest consumer. However, if a consumer is consistently slower, it would slow down all other consumers since buffering capacity is limited. That is typically not desired, and we detect such situations and kick out the slow consumer, which has to catch up again and possibly stay as a standalone stream without switching over to the firehose.

Question: I understand that this reduces the load of the database, but what other mechanisms are used to scale the cloud service horizontally to handle many edge services?

Answer:

Ordinary load balancer services can be used in front of the cloud service.
The cloud service is an Akka Cluster that can be scaled horizontally with Cluster Sharding for scaling to many entities and Sharded Daemon Process to distribute the projections in the cluster.
The event streams can be split up into slices, which are like partitions that can be processed in parallel. It’s possible to re-arrange the slice distribution at runtime.

Question: Can I use Akka Cluster at the edge?

Answer: Yes, no limitations. However, the H2 database can only be used with single-node services.

Conclusion

We have tried to solve some hard problems by using Akka at the edge. That was more details than you need to know for using Akka Edge, but sometimes, it can be good to have a deeper understanding of the underlying solution. Some of these mechanisms may need additional fine-tuning, and we look forward to your input on what should be improved. Please reach out if you have any questions or feedback.

If you haven’t already, watch the webinar and study the sample application in the Akka Edge guide.

Learn how to use the specific features described in this article:

Author

Patrik Nordwall

Akka Tech Lead, Lightbend, Inc.

Twitter: @patriknw

Patrik Nordwall joined Lightbend in 2011 and is tech lead of the Akka team. Patrik has long experience of developing distributed systems. His areas of expertise include Akka, Scala, and various products for highly scalable systems.

November 10, 2023