In my previous post Why Enterprises of different sizes are adopting ‘Fast Data’ with Apache Spark, I gave a quick introduction to how massive petabyte data sets proved to be unmanageable in a cost-effective way with traditional tools, which paved the way for Hadoop and NoSQL databases. Hadoop has traditionally been an environment for batch processing, while NoSQL databases provided some subset of record-oriented CRUD operations. More recently, the need to process event streams has become more important. My Typesafe colleague Jonas Bonér calls this “Fast Data”.
Another trend is the broader adoption of Big Data tools by mid-size organizations. They want the flexibility and cost advantages of tools like Hadoop, but they don’t have the same massive scale requirements. In other words, they need tools that scale down, as well as up. Hadoop services, especially MapReduce, have relatively large overhead, which isn’t amortized as well over smaller computations.
Hadoop MapReduce cannot be used for streaming data. This is one of the factors driving interest in Apache Spark. Spark not only supports large-scale batch processing, it also offers a streaming module for Fast Data needs. For more about Apache Spark adoption and usage, please download our latest publication “Apache Spark: Preparing for the Next Wave of Reactive Big Data” (see the Slideshare preview below).
Let’s review some representative examples of what enterprises would like to do with event stream processing:
Ingest the Twitter firehose or other social media feed, to see trending topics, especially when they relate to your business.
Aggregate and analyze log and telemetry data coming from servers and remote devices (think Internet of Things - IoT). Besides supporting location-aware services, such as finding the nearest restuarants, this data can be used to predict demand to drive dynamic scaling of services, predict imminent device failures (e.g., medical devices and network appliances), etc.
Analyze click-stream data to discover usability issues. Are users struggling with parts of your website or service? What parts do they use most often?
Capture financial transactions and look for patterns indicative of fraudulent activity.
Train spam filters on the fly, as new emails arrive, so these filters reflect up-to-date, not stale conditions.
This application type is problematic for MapReduce, whose batch-mode orientation for these needs is awkward at best and unviable at worst. All of these scenarios and more require real-time or at least frequent analysis of this data. The incoming stream could be small or very large; regardless, waiting many minutes or hours for the next MapReduce batch analysis is no longer adequate for many scenarios.
MapReduce uses coarse-grained tasks to do its work, which are too heavyweight for iterative algorithms. Another problem is that MapReduce has no awareness of the total pipeline of Map plus Reduce steps, so it can’t cache intermediate data in memory for faster performance. Instead, it flushes intermediate data to disk between each step. Combined, these sources of overhead make algorithms requiring many fast steps unacceptably slow.
For example, many machine learning algorithms work iteratively. Training a recommendation engine or neural networks and finding natural clusters in data work intuitively like this: you “guess” an answer to a set of equations representing your model, you measure the error, and then adjust the trial answer in a way that reduces the error. Wash, rinse, repeat until your guess produces small enough errors that your answer is good enough.
Another use for iterative algorithms is graph analysis. For example, Twitter and other social networks have data about the links between you, your connections, their connections, and so forth. That data is nicely modeled as a graph. Most graph algorithms traverse the graph one connection per iteration.
Obviously, you want the steps in these kinds of algorithms to be as fast and lightweight as possible.
The problem is escalated by the growing need to get answers faster, where previously a batch approach was considered good enough, but now a streaming approach that extracts information from data more quickly is a competitive advantage. Updating search engine data was traditionally done with periodic batch runs. It used to be fine if you updated your web pages today and those changes showed up in search engines tomorrow or so. Today, the competitive advantage lies in search engines that reflect the changing Internet as quickly as those changes are detected by web crawlers. For example, if a user searches for information about a news event, you would like your search engine to take that user to the latest information available.
Until now, developers uses various MapReduce hacks or alternative tools to support these needs, but clearly there is a need for a better computation engine that supports these algorithms directly, while continuing to support more traditional batch processing of large datasets.
Some of the reasons enterprises have begun adopting Spark heavily is that it’s more flexible and composable, with an API that is more concise and powerful that MapReduce.
Spark is built on a powerful core of fine-grained, lightweight, composable operations. So, you the developer get many of the operations you previously had to write yourself. Also, because they are lightweight, it’s easy to build iterative algorithms with excellent performance, at scale!
The flexibility and support for iteration also allows Spark to handle event stream processing in a clever way. Interestingly, Spark was designed to be a batch mode tool, like MapReduce, but it’s fine-grained nature means that it can process very small batches of data. Hence, Spark’s approach to streaming is to capture data in short time windows (down to one second or so) and do computations over each “mini-batch”.
This doesn’t meet all streaming needs. Sometimes you need to process each event, one at a time, but Spark provides a great “90% solution” for streaming needs. This stream capability combined with support for iterative algorithms, as well as Spark’s ability to run efficiently in smaller configurations, makes Spark applicable to a wider class of problems than MapReduce.
Developers who encounter the Spark API for the first time, whether it’s the Scala, Java, or Python API for Spark, are amazed at how concise it is compared to the MapReduce API. Developers become much more productive … and happy, I believe. There’s a subtle aspect of this, too. When an API is much more concise and much easier to use, it transforms the development process. Bugs appear less frequently. It becomes possible to write, deploy, tweak, then re-deploy Spark applications, much more so than MapReduce applications. The essential agility needed by data teams is easier to realize. Spark’s better performance makes it easier to build smaller, more focused applications that require fewer resources in production, too.
Finally, the concept of Elasticity comes into play here, starting with two aspects of scaling down. First, it’s easier to process data in smaller increments - getting closer to real time event processing, as we discussed. It’s also easier to use Spark on smaller datasets because there is less overhead to amortize. For example, you can easily run Spark on a laptop, whereas it’s much harder to do that with MapReduce. Both aspects broaden the spectrum of problems you can address.
Put another way, there are many data problems where the size of the data is too large to work with on a single machine, but a small cluster is perfect. The overhead of running MapReduce on these datasets is noticeable, especially if you are doing interactive queries of the data. Spark scales down nicely to such small clusters and it’s fast performance makes interactive analysis possible.
There is a widespread myth that Spark is an “in-memory” tool, meaning it only works if your data fits in cluster memory. This is false. Spark also scales up to very large data sets, where it’s performance will still be better than the performance of MapReduce in almost all cases. There is some ongoing work to improve Spark’s usage of resources at extreme scales, where MapReduce has had more years to mature, but Spark is rapidly catching up.
Enterprises are adopting Spark for a variety of reasons, such as the ability to support new iterative enterprise processing needs (like iterative event streaming). Spark can efficiently scale up and down using minimal resources, and developers enjoy a more concise API, which helps them be more productive. In summary:
Enterprises need the ability to process event streams in real-time for more reactive decision making. MapReduce’s powerful-but-slow batch processing is not able to meet these expectations for applications analyzing the firehose of social media feeds or incoming data from future IoT devices, and so on.
Iterative processing capabilities for things like machine learning algorithms and graphical analysis are provided by Spark, yet architecturally impossible for MapReduce. The market for “Fast Data” relies on instant, iterative processing to gain competitive advantage.
Spark’s simple & powerful API, greater flexibility & composability and the ability to elastically scale up & down in a resource efficient way is making it an attractive technology both for developers and organizations alike. Unlike the huge processing power needed to run MapReduce, Spark can be run locally on a single laptop.
Thanks for reading, and feel free to leave comments below or ping me on Twitter at @deanwampler.
p.s. Are you interested in learning more about Apache Spark from the experts firsthand? Sign up for Scala Days - San Francisco, March 16-18. There will be 3 technical sessions plus one full-day training session (just a few spaces left!)
For you attentive readers, the first 5 people who request a coupon code from info@scaladays.org (referencing this blog URL) will receive a non-trivial discount on tickets!