This article is featured in the DZone Guide to Big Data, Business Intelligence, and Analytics – 2015 Edition. Get your free copy for more insightful articles, industry statistics, and more.
One of the most noteworthy trends for Big Data developers today is the growing importance of speed and flexibility for data pipelines in the enterprise.
Big Data got its start in the late 1990s when the largest Internet companies were forced to invent new ways to manage data of unprecedented volumes. Today, when most people think of Big Data, they think of Hadoop or NoSQL databases. However, the original core components of Hadoop—HDFS (Hadoop Distributed File System for storage), MapReduce (the compute engine), and the resource manager now called YARN (Yet Another Resource Negotiator)—were, until recently, rooted in the “batch-mode” or “offline” processing commonplace. Data was captured to storage and then processed periodically with batch jobs. Most search engines worked this way in the beginning; the data gathered by web crawlers was periodically processed into updated search results.
The first generation of Big Data was primarily focused on data capture and offline batch mode analysis. But the new “Fast Data” trend has concentrated attention on narrowing the time gap between data arriving and value being extracted out of that data.
The opposite end of the spectrum from batch data is real-time event processing, where individual events are processed as soon as they arrive with tight time constraints, often microseconds to milliseconds. High-frequency trading systems are one example, where market prices move quickly, and real-time adjustments control who wins and who loses. Between the extremes of batch and real-time are more general stream processing models with less stringent responsiveness guarantees. A popular example is the mini-batch model, where data is captured in short time intervals and then processed as small batches, usually within time frames of seconds to minutes.
The phrase “fast data” captures this range of new systems and approaches, which balance various tradeoffs to deliver timely, cost-efficient data processing, as well as higher developer productivity. Let’s begin by discussing an emerging architecture for fast data.
What high-level requirements must a Fast Data architecture satisfy? They form a triad:
The components that meet these requirements must also be reactive (meaning they scale up and down with demand); resilient against failures that are inevitable in large distributed systems; responsive to service requests (even if failures limit the ability to deliver services); and driven by events from the world around them.
For applications where real-time, per-event processing is not needed—where mini-batch streaming is all that’s required—research shows that the following core combination of tools is emerging as very popular: Spark Streaming, Kafka, and Cassandra. According to a recent Typesafe survey, 65 percent of respondents use or plan to use Spark Streaming, 40 percent use Kafka, and over 20 percent use Cassandra.
Spark Streaming ingests data from Kafka, databases, and sometimes directly from incoming streams and file systems. The data is captured in mini-batches, which have fixed time intervals on the order of seconds to minutes. At the end of each interval, the data is processed with Spark’s full suite of APIs, from simple ETL to sophisticated queries, even machine learning algorithms. These other APIs are essentially batch APIs, but the mini-batch model allows them to be used in a streaming context.
This architecture makes Spark a great tool for implementing the Lambda Architecture, where separate batch and streaming pipelines are used. The batch pipeline processes historical data periodically, while the streaming pipeline processes incoming events. The result sets are integrated in a view that provides an up-to-date picture of the data. A common problem with this architecture is that domain logic is implemented twice, once for the streaming pipeline and once for the batch pipeline. But code written with Spark for the batch pipeline can also be used in the streaming pipeline, using Spark Streaming, thereby eliminating the duplication.
Kafka provides very scalable and reliable ingestion of streaming data organized into user-defined topics. By focusing on a relatively narrow range of capabilities, it does what it does very well. Hence, it makes a great buffer between downstream tools like Spark and upstream sources of data, especially those sources that can’t be queried again in the event that data is, for some reason, lost downstream.
Finally, records can be written to a scalable, resilient database, like Cassandra,Riak, or HBase; or to a distributed filesystem, like HDFS or S3. Kafka might also be used as a temporary store of processed data, depending on the downstream access requirements.
Most enterprises today are really wrangling data sets in the multi-terabyte size range, rather than the petabyte size range typical of the large, well-known Internet companies. They want to manipulate and integrate different data sources in a wide variety of formats, a strength of Big Data technologies. Overnight batch processing of large datasets was the start, but it only touched a subset of the market requirements for processing data.
Now, speed is a strong driver for an even broader range of use cases. For most enterprises, that generally means reducing the time between receiving data and when it can be processed into information. This can include traditional techniques like joining different datasets, and creating visualizations for users, but in a more “real-time” setting. Spark thrives in working with data from a wide variety of sources and in a wide variety of formats, from small to large sizes, and with great efficiency at all size scales.
You tend to think of open source disrupting existing markets, but this streaming data/fast data movement was really born outside of commercial, closed-source data tools. First you had large Internet companies like Google solving the fast data problem at a scale never seen before, and now you have open-source projects meeting the same needs for a larger community. For developers diving into Fast Data, the Spark/Kafka/Cassandra “stack” is the best place to start.