This is the second of six posts in our cloud-native series that focuses on streaming data applications. In the first post, we summarized common requirements, architecture considerations, and design patterns. We explored a few specific tools, namely Apache Spark and Apache Flink, which are engines, the term we use for systems designed for large-scale processing, often with standalone services to which work is submitted and subdivided across a cluster. (There are various ways to run these jobs…)
This post explores an alternative class of compute tools, libraries that provide streaming semantics, but which you embed in your microservices.
For a more detailed examination of these topics, see my free O’Reilly report, Fast Data Architectures for Streaming Applications. The last several posts in this series on streaming will discuss serving machine learning and artificial intelligence models in streaming applications, authored by Boris Lublinsky. For an in-depth analysis of this topic, see his free O’Reilly report, Serving Machine Learning Models.
For comparison, here is the architecture diagram I shared in the first post:
Figure 1 is taken from my Fast Data Architectures for Streaming Applications report. The numbers correspond to notes in the report. I won’t discuss all the details here, but note a few things that I also highlighted in the previous post. Kafka plays the central role of integrating services, both data processing and microservices, and capturing incoming data. Spark and Flink are popular tools for processing data, which I discussed in the previous post. This post discusses Akka Streams and Kafka Streams.
The previous post discussed characteristics of engines like Spark and Flink. In contrast, the libraries all share a few common characteristics:
They provide high-level abstractions for data transformations, but don’t abstract away the partitioning that might be required in a large data set. You have to partition the data as you see fit, e.g., into Kafka topic partitions, and you have to explicitly launch instances of your application targeted to one or more partitions, as appropriate.
They provide tremendous flexibility in how you integrate streaming into the rest of the application logic.
They can run with very low resource overhead, including the time it takes to process each record, depending on the application.
Specifically, for Akka Streams vs. Kafka Streams:
Kafka Streams supports event-time windowing, processing triggers, and SQL over streams.
Akka Streams supports fine-grained manipulation of data flows and record processing.
Kafka Streams reads and writes Kafka topic partitions. Akka Streams integrates with a wide variety of sources and sinks, through the Alpakka library.
Both libraries support durability and resiliency mechanisms to enable effectively once processing of records and preservation of evolving application state, even in the presence of failures.
While the engines like Spark and Flink provide a lot of benefits vs. drawbacks, the libraries are complementary:
Integration with microservices and the engines usually requires that you run the engines separately from the microservices and exchange data through Kafka topics or other means. This adds some latency and more running applications at the system level, but also helps prevent monolithic applications. In contrast, the libraries are embedded into the microservices, providing fast access (the cost of a function call).
Compared to the engines, libraries enable very flexible choices for how data is processed at runtime and how the applications are deployed, monitored, and managed, including scaling. However, you have to implement many of these capabilities yourself.
Library-based applications are often easier to tune, because they are fundamentally simpler systems, but the autosizing on start-up provided by Spark and Flink and their abilities to scale up and down in some context are capabilities you have to implement yourself in a library-based approach.
The low overhead of these systems make them ideal for smaller data streams, especially when partitioning is not required.
So, which should you choose, Apache Spark, Apache Flink, Akka Streams, or Kafka Streams. Here are some basic rule-of-thumb guidelines. You could easily use more than one option, depending on the particular problem:
Pick a streaming engine when your data streams are usually large enough to require partitioning and you want to minimize manual handling of this task:
Pick Spark if you already use it for batch processing or you want lots of options for out-of-the-box integration with machine learning systems.
Pick Flink if you aren’t using Spark for batch processing or you need the most state-of-the-art streaming semantics.
Pick a streaming library when partitioning is less often required or you prefer to work with the same microservices tools and processes you already know:
Pick Akka Streams when you need very fine-grained control over the data flow processing and you want the full suite of Lightbend Platform tools for your microservices (e.g., the rest of Akka!).
Pick Kafka Streams if all your streaming data is stored in Kafka, so your streaming jobs only need to read from and write to Kafka topics, or you come from a data background where SQL-like operations and windowing semantics are important to you.
Note that you can still complement Kafka Streams with the Lightbend Platform for the rest of your microservice needs.
There’s really a lot more to it. See my O’Reilly report listed above for more details on these choices and how to select which ones are best for your needs. Also, my colleague Boris Lublinsky and I developed several tutorials that illustrate using these options. They are freely available on GitHub:
Kafka-with-akka-streams-kafka-streams-tutorial - Uses Scala and Java examples for stream processing with Akka Streams and Kafka Streams. The sample application serves machine learning models (i.e., scores data records with them), including the ability to dynamically update the models in the running applications.
Model-serving-tutorial - An update to the previous tutorial that is more focused on the model serving problem. It adds examples using Spark, Flink, and TensorFlow serving. The code examples are all in Scala.
The next post in this series explores the important problem of managing state in a streaming application and how to access that state from outside the application, e.g., using queryable state.
Finally, if you'd like to learn how to make using Akka Streams and Spark more simple with Kafka and Kubernetes, check out Lightbend Pipelines, the newest module in Lightbend Platform. You can watch a 2-min intro video, or read more about it here: