Support
spark scala akka

Rocking out at Datapalooza with Cake: Interview with Jan Machacek

If there’s one thing that’s clear this fall, it’s that conference season is in full swing. With Typesafers attending JavaOne in San Francisco, Spark Summit in Amsterdam, W-JAX in Munich, Devoxx in Belgium, YOW in Australia, Gartner Summit in Las Vegas, Scala eXchange in London and more popping on our calendars by the minute, our dance card is pretty full. That said, sometimes certain events come up that just catch your eye and you must make time for them.

One example is next week's Datapalooza (amazing name), hosted by IBM’s Spark Technology Center. The three day event is a deep-dive with industry leaders from AMPLab, Galvanize, Typesafe, Cake Solutions, Silicon Valley Data Science, IBM Watson, Spare5, Declara and others who are the leaders of making data products. The organizers encourage participants to take their data skills to the next level with hands-on experience and one-on-one coaching to make a data product in only three days.

Typesafe’s Dean Wampler will be co-presenting with our friends at Cake Solutions on Muvr, a demonstration of an application that is built with Spark, Cassandra, and Akka, and uses wearable devices (Apple Watch)—in combination with a mobile app—to submit physical (i.e. accelerometer, compass) and biological (i.e. heart rate) information to a CQRS/ES cluster to be analyzed.

We sat down with Jan Machacek, CTO, of Cake Solutions to learn more about Muvr, stream processing, and what attendees can expect to learn at Datapalooza.

TS: Hi Jan! Can you tell us a little more about the origins for Muvr and how it came to be?

JM: Muvr came into existence almost exactly a year ago. I started it as a demo for my Scala eXchange talk in London. I wanted to explore distributed state (using Akka cluster sharding) and distributed computation (using Apache Spark). At the same time, I wanted to explore the IoT / wearables space and implement an entire CQRS/ES system.

TS: Why is stream processing important for wearable devices?

JM: In the early versions of Muvr, all sensor data was streamed from the mobile to the server without any processing. This allowed us to perform all the computation on the server, but incur the smallest possible latency penalty. Unfortunately, Muvr needs even faster processing; its users are sometimes completely cut off from Internet access. And so, we had to move all the stream processing to the mobile phone. Even though the as-it-happens analytics code no longer running on the server, its streaming nature remains: the only difference is that it is now implemented in Swift and GCD.

TS: Why does Muvr require services that are Reactive?

JM: It is important to look at all Muvr’s components: it is the streaming code that performs the classification on the mobile, the code that deals with non-blocking (and background) networking; the server components, which keep all the confidential data; ending up with the machine learning pipeline, which constructs new models for the mobile code to use.

Even though each component in Muvr’s architecture runs on a different computer, on a different network, it is important for each of the components to be resilient, responsive, and as loosely-coupled as the underlying hardware limitations allow.

TS: What are you planning on demonstrating at Datapalooza? What can attendees hope to learn during this 3-part session?

JM: We’ll start with a talk that gives overview of all Muvr’s components: the code that runs on the watch, the mobile; the Akka cluster, ending with the ML code—standalone as well as running at scale in Spark.

TS: Will we be able to get our hands dirty with some coding?

JM: There will be plenty of coding!

In the first session, we will implement the forward-propagation portion of a multi-layer perceptron using vectorised functions. This will allow the attendees to explore the details of MLPs, and to appreciate the difference between naive implementations and the very optimised ones in vectlib & BLAS. We’ll end the first session with a working iOS and watchOS apps! (Don’t forget to download Xcode and cocoapods!)

Now that we have the apps running, we’ll move on to the next session, where we will learn how to train these models. We will end up with a Python code that “discovers” the biases and weights for the MLP using a large set of labelled training data. (Be sure to have a UN*X-like system that can run shell scripts and Python 2.7; we’ll take care of all other dependencies.)

Having an MLP running on one’s laptop is great fun, but it is nowhere near what is needed for a serious, large-scale system. Running the training for one user & one muscle group takes approximately 30 seconds on a powerful computer. 300 seconds for 10 users; 3000 seconds for 100 users;... In the last session, we will trivially parallelise the training program, using Spark to distribute the computation. (Don’t leave your JDK 1.8, shell and Internet connection at home!)

TS: Anything else we should know before signing up?

We have worked very hard to design the exercises so that they demonstrate all aspects of a large-scale IoT / wearables system. You will have the opportunity to explore each layer in detail. The gurus from the BigData, ML and  IoT team from Cake Solutions, as well as Dr. Wampler from Typesafe, will be there to help you get truly complete understanding of these complex systems.

And you’ll have working system, including a watch app, mobile app and big data, machine learning system on your computers. In three hours!

--

Thanks! If you’re interested in attending Datapalooza, you can buy tickets here using the promo code PartnerDisc for a 50% discount.

Share