Kalix: Tackling the Cloud to Edge Continuum - with Jones Bonér - Watch On-demand

Using Apache Spark with Intel BigDL on Mesosphere DC/OS

Deep learning is becoming more and more pervasive as a machine learning technique across various domains like healthcare, transportation, communications, manufacturing and many other areas. As part of our work on Lightbend Fast Data Platform, we have been exploring various deep learning libraries. Some of the evaluation criteria that we have used to judge libraries for machine learning in general include the following:

  • Maturity
  • Usability including qualities of APIs offered
  • Integration with popular frameworks for distributed processing
  • Performance
  • Open Source
  • Model interoperability with other existing deep learning frameworks

BigDL is a distributed deep learning library from Intel released and open-sourced in 2016. Besides offering most of the popular neural net topologies out of the box, BigDL boasts of extremely high performance through its usage of Intel MKL library for numerical computation. BigDL supports import/export of networks pre-trained in TensorFlow, Caffe or Torch and there are plans to include interoperability with other libraries in the market. However, the most important feature that made us look into BigDL is its native integration with Apache Spark that enables users to leverage the distributed processing capabilities of Spark infrastructure for training and model serving. For enterprises already using Spark, BigDL works just like any other Spark library that supplements the machine learning capabilities offered by Spark ML.

Lightbend Fast Data Platform (FDP) uses Mesosphere DC/OS as the cluster manager. FDP aims to make it easier for customers to productize the use of BigDL through its offering by integrating Spark as part of the platform. Machine learning applications developed using BigDL and Spark can also take advantage of the best-in-class streaming engines, the Lightbend Reactive Platform and messaging technologies like Kafka that form the complete suite of FDP. In this blogpost, Lightbend’s Fast Data Platform team and Intel’s BigDL team collaborate to describe the experience of implementing and deploying deep learning models on BigDL using Spark on Mesosphere DC/OS.

Mesosphere DC/OS - An Introduction

Mesosphere DC/OS is a distributed operating system based on the Apache Mesos distributed systems kernel. It gives an abstraction of a datacenter to the user, hence it is also called a datacenter operating system. The complete distribution of DC/OS includes a distributed systems kernel, a cluster manager, a container platform and an operating system. As a datacenter operating system, DC/OS offers automation of the following tasks:

  • Resource management
  • Process scheduling
  • Inter process communication
  • Installation and management of distributed services

DC/OS architecture provides a layer of abstractions of the core cloud infrastructure or the “bare-metal”. The platform layer offers the core datacenter operating system support along with container and cluster management services. The software layer provides installation, configuration and management of services like databases (e.g. Cassandra), message queues (e.g. Kafka), distributed processors (e.g. Spark and Flink), notebooks (e.g. Zeppelin) and many others. Users can also install their own services and have scheduling, configuration and management done by DC/OS.

Spark and BigDL on Mesosphere DC/OS

Given the capabilities of Mesos and DC/OS, it’s no surprise that deploying Spark applications on it enjoys some intrinsic advantages with respect to manageability and scalability. DC/OS adds operational simplicity on top of Mesos kernel, which makes managing distributed platforms like Spark and Apache Flink much easier from the operational point of view. You get isolation amongst the various versions of your Spark applications, yet enjoying the benefits of the complete shared infrastructure that DC/OS offers. In summary, there are a lot of potential advantages to running distributed Spark-based applications on your Mesos DC/OS platform.

BigDL is a deep learning library for Spark and hence running BigDL-based applications on DC/OS gives you all the advantages that you get with native Spark applications. BigDL offers rich support for deep learning, can load/save models from frameworks like TensorFlow, Torch and Theano and is high performant via its usage of Intel MKL and multi-threaded programming in each Spark task. DC/OS complements these advantages with easier manageability and optimization of resources through a truly shared infrastructure. Just like any other Spark application, you can specify up-front your application’s potential resource usage and thereby achieve optimization of total cluster resources.

Deploying BigDL Applications on DC/OS

Once you have the Spark service running on Mesos DC/OS, deploying a BigDL application is quite simple. As an example, consider the following screenshot of DC/OS UI that shows the Spark service running:

The diagram shows the core Spark service running with a resource usage of 1 CPU and 1GB memory. Now when you deploy Spark / BigDL applications, they will show up as additional drivers in the UI. As an example, the above screenshot shows an additional Spark driver running with 1 CPU and 8GB of memory. The deployment of the BigDL application to a DC/OS cluster is as simple as executing the following command, where $SPARK_APP_JAR_URL refers to the location from which the application jar can be fetched for loading.

dcos spark run \
  --submit-args=” \
	--conf spark.cores.max=2 \
	--conf spark.executorEnv.OMP_NUM_THREADS=1 \
	--conf spark.executorEnv.KMP_BLOCKTIME=0 \
	--conf OMP_WAIT_POLICY=passive \
	--conf DL_ENGINE_TYPE=mklblas \
	--conf spark.executor.memory=8G \
	--driver-memory 4G \
	--class com.lightbend.fdp.sample.bigdl.TrainVGG $SPARK_APP_JAR_URL \
	-f cifar-10-batches-bin -b 16"

Once the application is deployed, you can navigate to the Mesos console and track the state of the running task along with the resources that have been granted to the process by the cluster manager. Here’s a snapshot of how it looks for a BigDL application that trains on the VGG network:

Note how the left panel also shows the other details of the cluster and the overall resource usage by all applications deployed on the cluster. You can navigate through the Sandbox link corresponding to the task and browse through the various log files that your application generates.

Scaling BigDL applications on Mesos

One of the big advantages of running applications in a Spark cluster is the ability of Spark to dynamically add and remove resources from your application. This is done by Spark’s dynamic resource allocation and a BigDL’s deep learning application can enjoy these advantages as well. In addition, you can use Mesos as the cluster manager that integrates seamlessly with Spark’s dynamic resource allocation. Thus, running multiple BigDL applications on a Mesos cluster offers you cluster optimization as well the ability to resize the number of Spark executors based on the statistics of your application. For more details, please have a look at this document on running Spark on Mesos.

BigDL Machine Learning API

BigDL offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and testing machine learning models. As an example, here’s a snippet that prepares a dataset for training in a VGG network[1] passing it through a pre-processing pipeline.

val trainDataSet = DataSet.array(Utils.loadTrain(param.folder), sc) ->
  BytesToBGRImg() -> BGRImgNormalizer(trainMean, trainStd) ->

The pipeline API makes it explicit that it starts with the raw bytes, then goes through the stage that converts raw bytes to BGR image, then the normalization of the image and finally the batch-based processing.

Similarly, using named parameters features of Scala, you can define the SGD process as follows:

new SGD[Float](learningRate = 0.01, learningRateDecay = 0.0,
  weightDecay = 0.0005, momentum = 0.9, dampening = 0.0, nesterov = false,
  learningRateSchedule = SGD.EpochStep(25, 0.5))

This makes the APIs self-descriptive and the resulting code much more readable without losing an iota of type-safety guaranteed by Scala’s type system.

Finally, when you have all datasets prepared, you can define the Optimizer that will train your model using optimization algorithms.

// define the Optimizer
val optimizer = Optimizer(
  model = model,
  dataset = trainDataSet,
  criterion = new ClassNLLCriterion[Float]()

// let it go
  .setValidation(Trigger.everyEpoch, validateSet, 
	 Array(new Top1Accuracy[Float]))

The idiomatic Scala API for training and testing machine learning models proved to be a big plus for BigDL.

Visualization with TensorBoard

The latest versions of BigDL offer integration with TensorBoard, a big step towards enabling the visualization of the learning process of your machine learning classifier. The BigDL Optimizer can be configured to generate summary information during training and validation steps that can be consumed and displayed by TensorBoard. Here is an example snapshot that displays the Loss and Throughput during the training of one of our classifiers.

Using TensorBoard integration, you also get nice visualizations for weights, gradients and bias - here’s a sample snapshot from a classifier based on VGG network:


This blogpost gives an overview of how you can use BigDL on Spark as a platform for machine learning within the Mesos DC/OS cluster manager. BigDL gives you the ability to implement complex neural net topologies for deep learning, while Mesos and DC/OS give you the power to run multiple applications within the confines of an optimally managed cluster. Lightbend Fast Data Platform makes it easy to use BigDL on Spark along with the other streaming engines, messaging technologies and the reactive platform that it offers.

[1] VGG network is a variant of convolutional neural network developed by the Visual Geometry Group of Oxford.


The Total Economic Impact™
Of Lightbend Akka

  • 139% ROI
  • 50% to 75% faster time-to-market
  • 20x increase in developer throughput
  • <6 months Akka pays for itself