The explosive interest in building and deploying message-driven, elastic, resilient and responsive Reactive applications in enterprises continues to drive the need for real-time data streaming and instant decision-making. We see tools like Apache Spark, Cassandra, Riak, Kafka, Akka and Slick embracing this trend already. Additionally, the reality of Reactive Streams 1.0.0 is helping to pave the way for a new generation of somewhat alarming tools in the fields of Machine Learning and Deep Learning, which some believe may predicate the SkyNet takeover of Earth...
Nonetheless, there are two emerging projects in our ecosystem–Deeplearning4J and BIDData–that will get architects and developers passionate about data-centric computing. Let's quickly go over what these upcoming areas are in plain language.
First, let’s talk about Machine Learning, and why we should care. Machine Learning evolved from the study of pattern recognition and computational learning theory in artificial intelligence. It explores the construction and study of algorithms that can learn and executive predictive data analysis. That’s right. Think Data from Star Trek, but without the humor (or nearly-indestructible android body).
The alarming part is that you’re basically teaching computers to act without being explicitly programmed. In the past decade, Machine Learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. It’s so pervasive that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI.
Deep Learning is a cutting-edge branch of Machine Learning, which deals with a class of algorithms based on neural networks of a certain shape (i.e. deep networks). Its main goal is to help enable Machine Learning move ever-closer to AI. Deep Learning is forward-thinking, and Facebook, Google, Baidu, and Netflix are heavily invested. A small selection of recent breakthroughs using Deep Learning are:
One prominent feature to note is that training those systems involves a lot of computing power which is fairly specialised. One of the components is the Graphic Processing Unit (GPU), which has been showing robust performance for this type of processing, and could serve as an alternative to CPUs in the future. The whole technology is new, but already the source of revolutionary achievements for large segments of Machine Learning problems.
If you’re interested in those new use-cases and algorithms as a developer, the following technologies aim to get you started on it.
Deeplearning4J (DL4J) as the name implies, is Deep Learning on the JVM. As a matter of fact, it’s the only project offering a commercial-grade, open-source, distributed deep-learning library for Java and Scala. DL4J is directed at developers, not researchers, and businesses, so it's integrated with Hadoop and Spark.
It’s competitors are PyLearn (Python), Caffe (C++) and Torch (Lua). Just to give it a little more credibility, the main authors, Chris Nicholson & Adam Gibson, are huge Scala fans, excited about OSS, and built a library of Deep Learning software tools that are freely available to everyone. The slides above are from Adam's talk at Scala Days 2015 Amsterdam.
A project that takes a different direction is BIDData. It’s one among its fellows, BIDMat, BIDMach and Kylix, done in John Canny's lab at Berkeley. Instead of being a complete library, this is more a set of general-purpose libraries for numerical and Machine Learning algorithms that deal with how to best implement them on GPUs.
The goal is to expose a Scala interface to the GPU, which lets you code for it in Scala without having to deal with how this hardware works. This library is the top performer, bar none, in many contexts: just check out these benchmarks. This library beats the performance of both Spark and other competitors, even in the most disadvantageous setups. They've been covered by NVIDIA's dev blog and elsewhere, but the gist of it is that the author computes the best possible runtimes (in this case, dealing with the most data per second) and determines the shape of his implementation from that, something called a roofline design. This is an approach shared with the next best competitor, called Vowpal Wabbit (Yahoo Research, then Microsoft Research). Yet BIDMach is still faster.
But parts of the BIDData project go beyond 'running super fast on GPUs'. The concept of Butterfly Mixing, and their implementation of it – called Kylix – also solves a fundamental problem with big data engines such as Hadoop or Spark. The communication patterns of those platforms make all machines on a cluster want to use the network at the same time, and therefore overwhelm it. Kylix achieves a better runtime by making those machines compute or transfer data on the network alternatively, in lockstep.
John Canny, perhaps unsurprisingly, is also a huge Scala fan, and coding much of BIDData in Scala. He’s done amazing work with a small team, including just a few PhD students, one of whom happens to be hearing impaired. You can see John’s recorded talk, Machine Learning at the Limit, during a Machine Learning Meetup in San Francisco.
The new discoveries in information through Machine Learning and Deep Learning are considered a key differentiator for data-driven organizations. It’s no surprise that Scala is a great fit for this new field in computing because of its functional capabilities and straightforward programming paradigms, which empower the creators of great frameworks and libraries to help developers use their machine learning algorithms more easily. The above frameworks are just two highlights of a growing number of hot technologies which are built with Scala, so go dig around and discover even more.
But if you want to learn more about these technologies from the experts, check out some of our training courses. Typesafe has the technology, expertise and support for working with technologies that support your Big [Fast] Data requirements, like Spark, Cassandra, Riak, Kafka, Akka, Play, Scala and Mesos. To resiliently and elastically support distributed, streaming data applications is one of the biggest advantages of the Typesafe Reactive Platform technologies, not to mention our recently announced developer support for Apache Spark standalone, or on Apache Mesos / Mesosphere DCOS. So, if you ever need help with anything in the Scala universe or you're looking around for Scala, Spark, Slick, Akka or Play Framework training, we've got you covered.