I started my first Hadoop-based, Big Data projects four years ago. As an experienced Enterprise and Internet developer with a love of Scala, it was frustrating to find that writing Hadoop jobs was hard to do in the tedious, low-level MapReduce API. The powerful, functional programming operations we know and love in Scala’s collections were absent. The object orientation of Java at the time (before Java 8 introduced lambdas) didn't fit with a dataflow model that's more natural for analytics, where a series of transformations is routine. Functional Programming, as expressed with Scala’s collections, for example, is a much better fit. Also, it was clear that the performance of MapReduce was poor and it couldn't be used for "real-time", event stream processing.
There were a few tools that layered higher-level abstractions and easier APIs on top of MapReduce. If your problem fit a query model, you could do SQL queries with Hive. For some common dataflow problems, you could use Pig, but it was quirky and not "Turing-complete". Some third-party APIs emerged with better abstractions, such as Cascading (Java), Scalding (Scala), and Scoobi (also Scala), but they couldn't fix the inherent limitations of MapReduce as a compute engine.
But interest in an alternative compute engine, Apache Spark, was growing. Today, Spark has emerged as the next-generation platform for writing Big Data applications.
Spark addresses the limitations of MapReduce in the following ways.
It's true that a wide-range of algorithms can be implemented in MapReduce, but the limited number of operations makes the effort difficult, requiring special expertise and "clever hacks". In contrast, the Spark API offers the same powerful operations that Scala’s collections offer, only at much larger scale. These operations allow arbitrarily-sophisticated algorithms to be implemented with relatively little code and few programming hacks. They are available in Spark's Scala, Java, and Python APIs. Writing Spark applications feels very similar to writing applications with Scala's collections.
Spark's Resilient Distributed Datasets (RDDs) are fault-tolerant, distributed collections of data with in-memory caching to avoid unnecessary round trips to disk. This feature alone has been shown to improve performance over MapReduce by 10x to 100x.
Part of Spark's success is due to the foundation it is built upon, components of the Typesafe Reactive Platform.
First, there's Scala. Scala is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. It smoothly integrates features of object-oriented and functional languages, enabling Java and other programmers to be more productive.
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data framework I know of, that begat things like FlumeJava and Crunch). On the JVM, the only language that would offer that kind of API was Scala, due to its ability to capture functions and ship them across the network. Scala’s static typing also made it much easier to control performance compared to, say, Jython or Groovy.
Akka is the second Typesafe component used within Spark. Akka is a system for building highly concurrent, distributed, and fault tolerant event-driven applications. With its Actor model of computation and let it crash approach to resiliency, Akka provides many of the essentials tools for building reactive JVM applications that are distributed, fault-tolerant, and event-based systems. They can scale up to large clusters or down to small processes, as needed.
The combination of Apache Spark and the Typesafe Reactive Platform, including Scala, Akka, Play, and Slick, gives Enterprise developers a comprehensive suite of tools for building Certified on Spark applications with minimal effort.
Typesafe will continue to build technologies and functionality that makes Spark even greater, including improvements to Scala and the Scala library. We'll also work to make the developer experience as seamless as possible between our tools.
Are you interested in trying Spark? I encourage you to check out our growing Typesafe Activator templates for Spark, especially my introductory Spark Workshop, which is our first Certified on Spark application. More templates are coming soon.