Design Techniques for Building Streaming Data, Cloud-Native Applications: Part 1 - Spark, Flink, and Friends
Dean Wampler, Ph.D.VP of Fast Data Engineering
A Look at Streaming Engines and Fast Data Architectures
This is the first of six posts in our cloud-native series that focuses on streaming data applications. We’ll discuss requirements, architecture considerations, design patterns, specific tools, etc.
For a more detailed examination of these topics, see my free O’Reilly report, Fast Data Architectures for Streaming Applications. The last several posts in this series on streaming will discuss serving machine learning and artificial intelligence models in streaming applications. My colleague Boris Lublinsky will write those posts. For an in-depth analysis of this topic, see his free O’Reilly report, Serving Machine Learning Models.
In this first post, I’ll discuss some requirements to consider, a few architecture points, and then discuss streaming data engines, which are standalone services that do a lot of heavy lifting for streaming data pipelines.
The next post will look at an alternative approach, using streaming data libraries to embed streaming semantics in your microservices. This approach provides greater flexibility, but requires you to implement some of the capabilities you get “for free” from the engines.
Requirements for Streaming Data Applications
In brief, these forces drive requirements for streaming (“fast data”) architectures, where I’ll use the term batch processing as a catch-all term for data warehousing and other “off-line” forms of data analytics:
It’s a competitive advantage to extract useful information from data as quickly as possible.
Many applications need data faster than batch processing, like ad serving and mobile apps; however, some data analytics can be done using batch processing ahead of time).
Because of the time sensitivity, streaming analytics need to be integrated with other processing systems in the environment, more than is typical for batch processing.
Stream processing applications are “always on,” which means they require greater resiliency, availability, and dynamic scalability than their batch-oriented predecessors.
Characteristics of Streaming Data Architectures
While the requirements in the previous section can be implemented many ways, the following characteristics are widely used in implementation architectures. See also Figure 1 below:
A stream-oriented data backplane is required for capturing incoming data and serving it to consuming services, which then need to use the same backplane for results needed by downstream services. Today, Apache Kafka is the most popular choice for this data backplane. As an interservice integration tool, Kafka helps support heterogeneous environments, since everything needs to be closely linked, but not too closely linked.
The microservices community has developed mature techniques for meeting the requirements for resiliency, availability, and scalability. Hence, streaming systems need to adopt these techniques, making them work more like conventional microservices compared to batch systems.
If we extract and exploit information more quickly, leading to a more integrated environment between our microservices and stream processors, then our streaming architectures must be flexible enough to support heterogeneous workloads. This dovetails with the parallel trend toward large, heterogeneous clusters manage with Kubernetes or similar resource managers.
Figure 1 is taken from my Fast Data Architectures for Streaming Applications report. The numbers correspond to notes in the report. I won’t discuss all the details here, but note a few things. Kafka plays the central role of integrating services, both data processing and microservices, and capturing incoming data. Apache Spark and Apache Flink are popular tools for processing data. (We’ll discuss Akka Streams and Kafka Streams in the next post.) A wide range of persistence options are possible, depending on the requirements.
Characteristics of a Streaming Engine
If you know the Hadoop ecosystem, you know that the three core components of Hadoop are the following:
Hadoop Distributed File System (HDFS) for data storage
MapReduce, Spark, or similar for processing the data (“compute”)
YARN for managing resources and application instances (e..g, jobs and tasks in Hadoop parlance)
In the streaming world, the analogs are these:
Kafka for storage of “in-flight, i.e., streaming data
Spark, Flink, Akka Streams, Kafka Streams, or similar for compute
Kubernetes or similar for managing resources and application instances
Akka Streams and Kafka Streams are examples streaming libraries, as discussed above. In this post, we’ll focus on Spark and Flink as the two most popular streaming engines, representative of many alternatives available, both commercial and open-source.
Spark and Flink share a few common characteristics:
They provide high-level abstractions for working with data as a whole, such as SQL with streaming extensions and APIs that support definitions of “data flows”.
They support large datasets and high volumes of data per unit time by partitioning the data and distributing it across a cluster.
The mechanics of partitioning large data sets across a cluster and managing multiple services and application instances for processing the partitions is largely hidden from the user.
The levels of abstraction between high-level constructs, like SQL, and low-level runtime implementation details allows these tools to support several useful features:
Sophisticated streaming semantics, like event-time windowing, processing triggers, SQL over streams, etc.
Durability and resiliency mechanisms to enable effectively once processing of records and preservation of evolving application state, even in the presence of failures.
All the work these engines do for you is a great labor-saving benefit, but they have a few drawbacks.
Integration with other microservices usually requires that you run the engines separately from the microservices and exchange data through Kafka topics or other means. This adds some latency and more running applications at the system level, but also helps prevent monolithic applications.
If your application doesn’t fit the runtime model of the engine, you have to use something else.
These systems can be difficult to tune and manage.
The overhead of these systems make them less ideal for smaller data streams, e.g., those that don’t require partitioning.
In the next post, we’ll explore the characteristics of library-based approaches, such as Akka Streams and Kafka Streams. We’ll discuss their advantages and disadvantages, especially compared to engines like Spark and Flink. If you want to learn more about Fast Data architectures, check out this O'Reilly eBook:
Dean Wampler is the Vice President of Fast Data Engineering at Lightbend, where he leads the efforts around streaming data, machine learning, and real-time analytics features of Lightbend Platform. Dean is the author of Programming Scala and Functional Programming for Java Developers and the coauthor of Programming Hive, all from O’Reilly. He is a contributor to several open source projects and he is the co-organizer of several conferences around the world and several user groups in Chicago.