Scala and Spark Notebook: The Next Generation Data Science Toolkit
A Little Background...
Andy Petrella at Data Fellas wants you to be productive with your data. This explains why he labored to create Spark Notebook, a fascinating tool that lets you use Apache Spark in your browser and is purposed with creating reproducible analysis using Scala, Apache Spark and other technologies. Spark Notebook offers these capabilities to anybody who needs to play with data, leveraging not only Spark for all data manipulation, but also the Lightbend Reactive Platform, to offer unique power to the user.
On the coattails of our recent whitepaper Fast Data: Big Data Evolved, which goes into Spark, Spark Streaming, Akka, Cassandra, Riak, Kafka and Mesos, we wanted to take the opportunity to sit down and get an update from Andy about his efforts on Spark Notebook and other projects he and the Data Fellas team is working on.
The following content below is authored by Andy Petrella.
Undertaking the challenge to data science
Data science was originally focused on producing static products (reports, models, …) based on samples of the available data at the time of analysis. Nowadays, the results need to be reactive with the data flow, requiring new data types. Also, the data sizes can be really big.
Take, for example, mapping Twitter activity based on streaming data. In the video below you can see a notebook that is consuming the Twitter stream, filtering the tweets that have a geospatial information and we plot them on a map that is narrowing the view to the minimal bounding box enclosing the last batch’s tweets. Along this display, we also compute the top hashtags that have been mentioned during the last minute.
This explains the rise of distributed computing and online analysis, the union of which could bethought as the Reactive Data Science Pipeline. However, such a pipeline requires many skill sets, including data science, operations software engineering, domain knowledge, and others.
A direct consequence is the fragmentation of skills across team members, leading to longer times from conception to production and duplication of efforts. Is there a solution to all this horrible pain?
A solution to all that horrible pain!
At Data Fellas, we are building Shar3, the ultimate toolkit that aims to build Reactive data science pipelines by reducing the friction between the different building phases.
Shar3 is composed of notable OSS technologies like Apache Avro, Apache Mesos, Apache Cassandra, Apache Spark, Lightbend Reactive Platform (Scala, Akka, Lagom, Play, and Spark), Spark Notebook and more. These components were chosen with a strong focus on scalability and the capacity to reactively adapt to their ever-changing production environments.
Taking advantage of the integrated and interactive Spark Notebook component, Shar3 enables:
- The construction of models on a full dataset, not just subsets
- The generation of deployable products to Mesos clusters
- The creation of Avro and Play/Akka HTTP powered web services that use the resulting dataset
- The generation of ad hoc visualisations thanks to the definition of types using Avro
- The creation of repositories and indexes of the analyses and services
Let's discuss a little about how Lightbend Reactive Platform plays a role here.
How Scala, Akka and Play fits into the picture
Although over years technologies change, markets evolve, communities split or converge, and requirements shift or adapt, there is at least one thing that we can consider invariant: the run for productivity.
Most of the tools Shar3 includes and all components developed by Data Fellas are based on the Lightbend Reactive Platform.
First Scala is a no-brainer technology choice, based on its rich feature set. This can be a strong decision when targeting a community fluent with other languages like Python or R, which offer a huge set of libraries to ease the work of data scientists––SciKit-learn and the caret package are good examples. However, these libraries cannot be used in a distributed computing framework like Apache Spark, because their implementations can only execute on a single machine. At best they can use only the available cores on one machine.
Also, though it’s true that there exist bindings for other languages, data scientists can only use the features that have been proxied in their language. More importantly, these languages cannot really be used in Apache Spark to implement new models adapted to new use cases (or only even proposed new ensembling trees or add a new layer type in an ANN) because they won’t have the proper bindings and will suffer poor performance.
Because Scala is the core language in Spark, it's necessary for Python, Java and R to have their wrappers created from Scala. Some might think that Java would also be a good way to create new features since Scala and Java are interoperable. But take a minute to think about that: to create your new feature in Java you’ll use either the Scala API directly or a Java wrapper.
We use Play Framework for all our web-oriented development to provide access to processing, visualisations, and interactivity, including over WebSockets. The founders of Data Fellas are among the earliest adopters of this Play, which was chosen for two main reasons, support for async NIO and high development velocity.
We use Akka and Spray (now Akka HTTP), because these components are well known tools for concurrent programming and data sharing. For instance, Akka Remoting is used to reactively communicate between the Spark Notebook processes and Spark processes. On the other hand, Spray can be used to create lightweight, stateless and scalable Avro based micro services.
Testing it all out and getting involved in Open Source
You can try Spark Notebook right away, but Shar3 is actively under development, so we'll soon open the early access program for early adopters that need to strengthen their data science production line. Shar3 enables a fully-interactive and Reactive execution environment for your Scala code running in a Spark cluster or locally. This lets you produce a ton of charts on your Scala or Spark types (and not only on SQL or DataFrame). Here are some more interesting things of note:
- Library of widgets to interactively manipulate your code using forms, text box, sliders, …
- Examples to help you through the creation of Machine Learning models, creating charts, adding interactivity to your presentation
- Great documentation and a very enthusiastic community helping on gitter
- Stability even in multi-user mode thanks to the Lightbend Reactive Platform
- The separation of concerns by creating a new JVM for each notebook, so you can tune your SparkContext as you like
- Export code as Scala
- Easy packaging system thanks to the Spark Notebook Generator
- Lightbend and Data Fellas experts behind you
To get update, anybody can register to the Shar3 newsletter on our website, http://data-fellas.guru.
Meanwhile, the Spark Notebook is already a great success and has an ever growing acceptance rate, so grab your own copy at http://spark-notebook.io or drop in this code:
docker pull andypetrella/spark-notebook:0.6.1-scala-2.10.4-spark-1.5.0-hadoop-2.2.0 docker run -p 9000:9000 -p 4040-4045:4040-4045 andypetrella/spark-notebook:0.6.1-scala-2.10.4-spark-1.5.0-hadoop-2.2.0 open http://localhost:9000
As any open source tool, we’d love to see the community more and more active on the code base, so anybody can grab a feature request on the github page, https://github.com/andypetrella/spark-notebook/, and submit a PR.
Thanks for reading, and feel free to leave comments below :-)
Expert support from Lightbend
Lightbend is here for your Spark production needs. Lightbend, a partner of Databricks, Mesosphere and IBM, provides developer support, production SLA on-site training to ignite your Apache Spark:
- Developer support for Spark Core, Spark SQL & Spark Streaming
- Deploy to Standalone, EC2, Mesos clusters
- Expert support from dedicated Spark team
- Unlimited dev incident reports
- Optional 10-day “getting started” services package