A couple of weeks ago, Typesafe launched the results of a survey in which over 2000 people were asked about the explosive adoption of Apache Spark. In the Slideshare presentation embedded above, you can see a sneak preview of some of the results of Apache Spark: Preparing for the Next Wave of Reactive Big Data, but the full version has a lot more to offer. The Scala community is showing intense interest in Apache Spark as well (according to the report, 88% of Spark users are working in Scala, 44% in Java, 22% in Python). So as resident “Apache Spark guy”, I thought it would be nice to put the popularity of Apache Spark in context, looking at what led us here, how enterprises are reacting, and what the needs of the mid-market really are.
It’s easy to blame our problems on the Internet, but in fact were aren’t far off here. The scope of the challenge we have now is essentially rooted in the fact that the Internet created enormous petabyte data sets that no one knew how to manage.
When the biggest Internet companies started accumulating data sets of unprecedented size, we saw an emergence of alternative technologies with better scalability at much lower cost and better support for “always-on” reliability. Hadoop, NoSQL databases and massive-scale virtual file systems were outgrowths of this trend.
The biggest problem the last few years has been a need to get answers faster. It’s less acceptable to wait hours or days to extract important information from new data via batch processing. Historically, Hadoop has been focused on batch processing of data. Now, organizations don’t want to wait. Ideally, they want useful information extracted from data as soon as it arrives.
Hadoop is ideal for truly massive data sets, like Twitter, Facebook and others have accumulated. However, enormous, all-encompassing installations don’t always get to the heart of the matter, so there is a pressing need for more flexible, lightweight and resource-efficient tools for smaller data sets. Not everyone can “throw hardware at the problem.” So, there is a large market of mid-size, mainstream data sets that need servicing. We can call them “Medium Data” needs--the trouble is, where does the expertise lie in this area?
A challenge for organizations who need to process data sets of small or medium sizes is that they tend to be short of engineers experienced with using Big Data tools this way. This has created an opportunity for the Hadoop vendors to provide that expertise, when organizations don’t have the people or development culture in place to know what to do.
The Big Data world is not the same “developer culture” that many of us are used to; here, we need to bend the traditional development process that enterprise applications developers are accustomed to. Big Data applications tend to be small and releasing these iteratively is essential. Thus, many enterprise development processes and tools geared towards longer-term projects don’t fit well here. What is required is true agility and development tooling that fits that idea.
Big Data affects the operations side, too. For example, when I did Hadoop consulting a few years ago, I found that many operations teams weren’t capable of operating clusters of servers. They were used to running a few big servers, not clusters of smaller ones.
Hadoop is a great tool for batch-mode analysis, but MapReduce has limitations; namely performance issues for complex jobs, a difficult programming model and troublesome API. Hadoop MapReduce was never designed for rapid analysis of smaller datasets. Its core architecture makes it essentially impossible to adapt MapReduce for this purpose.
MapReduce runs coarse-grained processes, with a lot of overhead. That overhead is amortized over very large data sets that get processed in batch runs lasting many minutes to hours (or longer), but it doesn’t make as much sense for jobs running small or medium data sets.
Today, more and more organizations want to analyze data as it comes in, so called event-stream processing. Examples include ingesting the Twitter “firehose”, server log data, device telemetry, real-time traffic, etc. MapReduce requires you to capture that data to a file system and then periodically sweep through it for analysis. Instead, that data needs to be analyzed as it arrives, to extract useful information, often with strict time constraints to react to it.
Another issue that is a concern for developers is the MapReduce Java API itself. It is very low level and quite tedious to use. It doesn’t provide lots of common operations you do over and over again, like joins and sorting. You have to implement these operations yourself. Developers need a concise, intuitive API that provides these operations and other widely needed features, while still supporting the kinds of custom algorithms that only a “Turing-complete” language can provide. This is probably why we’ve seen such a high response from the Scala community when it comes to faster, more powerful and easy to use data computation tools like Apache Spark (written in Scala). I suggest you find out more about this in Apache Spark: Preparing for the Next Wave of Reactive Big Data.
On this note, I’ll stop here to keep things a bit short for you busy engineers out there. Here is the Too Long; Didn’t Read (TL;DR) version of my main points in this article:
Evolving Internet use created petabyte data sets that no one could manage with traditional tools in a cost-effective way; tools like Hadoop and NoSQL databases emerged to handle Big Data.
In recent years, Fast Data, not Big Data, has been more important to react to real-time events and decision making; slower batch processing of large data sets with Hadoop is not always as important for most SMEs.
Hadoop MapReduce cannot be altered to rapidly analyze small data sets, i.e. event-stream processing. MapReduce’s architecture, API and programming model is being put aside in favor of Apache Spark, which better serves the Fast Data needs of the larger mid-market segment of companies, while still supporting large-scale batch processing.
Apache Spark is seeing exponential growth in adoption and awareness; more about this can be viewed in the recent Typesafe survey mentioned above of over 2000 respondents.
For now, check out some recent talks I’ve given: Why Scala is Taking Over the Big Data World (Skills Matter) and Why Spark is the Next Top (Compute) Model (InfoQ). Please leave comments below, or follow my Big/Fast Data musing at @deanwampler on Twitter.