People love Apache Spark. Typesafe is the official support partner with Databricks, Mesosphere and IBM for Apache Spark. If you’re a Fast Data visionary, then you’re looking for a modern solution to enable your streaming data applications.
In today’s DevOps world, you’re looking for a more accessible way to work in particular with Spark Jobs. Instead of directly mingling around with the specific Spark installation, you want to have access to a lot of functions via lightweight protocols remotely. This is the gap that Spark Job Server closes.
We made an oppotunity to chat with Evan Chan, a well-known name in the Scala and Spark communities, and the creator of Spark Job Server. He let us ask him a few questions about this amazing open source project.
Evan: The initial idea of Spark Job Server was to provide a RESTful service that can provide much of the repeated infrastructure involved in producing Spark jobs. In a Spark environment you normally deploy new jobs via a “spark-submit” script. It requires a Spark developer to have the full distribution installed and set up. This is an ideal setting if only one person is implementing those jobs, including deploying, scheduling, and automating everything.
But in the real world, Fast Data projects are usually a cross-team effort. The DevOps movement especially requires everybody on those teams to be able to work with Spark jobs in some way.
Evan: Spark’s native job submission interfaces can’t be easily exposed in the same way. Spark Job Server handles this by exposing a REST-based administration interface over HTTP/S, and makes it easy for all team members to access all aspects of Spark jobs “as a Service”.
Spark Job Server also integrates nicely with corporate LDAP authentication. Its functionality is specially designed for low-latency jobs and queries, and enables enterprises to build public- and enterprise-facing functionality very easily.
Evan: Spark Job Server is not only ideal for managing ad hoc Spark jobs, but it’s especially helpful with sharing Spark Resilient Distributed Datasets (RDDs) in one SparkContext amongst multiple jobs. For example, you can spin up a SparkContext, run a job to load the RDDs, then run multiple low-latency query jobs on the RDDs. This lets you execute sub-second queries on shared RDD data.
A typical use-case for this kind of feature will be the near-real-time integration of big data into time-sensitive client applications, like mobile phones or embedded devices.
Evan: The use of Scala has been very central for the Spark Job Server project. Spark itself is written in Scala, and Scala provided the easiest integration point. But, it also had to be a Reactive web application. So, we decided to go with Akka and Spray (now Akka HTTP).
There are two separate ActorSystems, or clusters of actors, in the job server architecture (JobServer and JobContext). The Spray-JSON-Shapeless (SJS) integration made it especially easy to provide a Reactive interface on top of Spark, and allows events from Spark jobs to be streamed to the actors.
The transition of the Spark Job Server architecture from single node to a distributed system was easy due to the Actor pattern and location transparency inherent in Akka. Developing the Spark Job Server to solve the distribution problem, and integrate tightly with Spark, would have been much more difficult with other languages. But we also did integrate a preliminary support for Java with JavaSparkJob. The JVM is a great runtime and we want to empower the JVM ecosystem best.
Many thanks to Evan for taking the time to talk to us about Spark Job Server! If you want to learn more, visit GitHub’s Spark Job Server page to join the 60 other committers on the project. If you’re looking to learn more about the role of Fast Data architectures and tools like Apache Spark, Akka, Cassandra, Kafka, Mesos and others, check out Dean Wampler’s recent white paper Fast Data: Big Data Evolved (PDF).