Today we are excited to share a technical article by our friends at GearPump (https://github.com/intel-hadoop/gearpump) a high performance, lightweight, real-time streaming engine built on top of Akka. Sean Zhong, Kam Kasravi, Huafeng Wang, Manu Zhang and Weihua Jiang have written a quick summary below, linking to the full paper.
GearPump Real-time Streaming Engine Using Akka
Big data streaming frameworks today must process immense amounts of data from an expanding set of disparate data sources. Beyond the requirements of fault tolerance, scalability and performance, compelling streaming engines should provide a programming model where computations can be easily expressed and deployed in a distributed manner. Akka is a natural fit to meet these requirements.
GearPump is built on top of Akka. It uses an actor tree to model the whole streaming system (Figure 1). With this hierarchy, many features of Akka such as error supervision, isolation and concurrency etc. are leveraged to build a resilient, scalable and performant system.
Figure 1: Gearpump Actor Hierarchy
Contributions of this framework include:
- “One at a Time” Transactional Message Processing: Every message is processed immediately without batching, there is no fixed latency. At the same time, every message is processed exactly once.
- High Performance: With 4 nodes, the whole throughput can reach 11 million messages per second (100 bytes per message) with average latency of 17ms.
Figure 2: Performance evaluation
- Lightweight HA (High Availability) Implementation: We use Akka clustering and CRDT data types to support HA and avoid any SPOF (single point of failure). It is very lightweight; there is no dependency on third party service like zookeeper.
- Flexible Code Provisioning and Deployment: The computation can be flexibly deployed to remote node. It is especially useful for Internet of Thing use cases.
- Dynamic Computation DAG (Directed acyclic graph): GearPump uses a Path centric DSL to express the computation DAG, it is very flexible and intuitive. This allows GearPump to add, remove, or replace a sub graph during runtime.
- Application Clock and Out of Order Message: Message may not arrive in timestamp order, so it is hard to know when to close the time window for calculation. GearPump can support this model by tracking the low watermark of application clock.
- Flexible programming API: GearPump supports multiple layers of API (ongoing). At the lowest layer, user can express their computation as DAG and each DAG node as actor (done). And we are planning to add higher level expression like streaming SQL on top of it.
In this technical paper we will share the experience, pains and gains, about how we use Akka to build this simple and powerful streaming engine.