spark scala flink lightbend

How Lightbend Got Apache Spark To Work With Scala 2.12

Stavros Kontopoulos Senior Engineer, Lightbend, Inc.

Along with technologies like Akka, Lagom, Play, and Apache Kafka, Apache Spark is primarily written in Scala and all Spark APIs are supported first in Scala. Although Scala 2.12.0 was released almost two years ago, it was not straightforward to upgrade Spark to use it, because of Spark internals that leverage details of how Scala translates language constructs to JVM byte code. The major issues are summarized in this umbrella JIRA ticket, SPARK-14220.

The Hard Parts

One of the most pressing challenges was to make Spark's ClosureCleaner work in Scala 2.12. ClosureCleaner is a module that cleans out some of the unnecessary outer variables that Scala closures may have captured, which makes serializing the closures more efficient. Additionally, some of these variables may be non-serializable, causing the serialization process to fail.

The main difficulty of updating ClosureCleaner to work with Scala 2.12 was the fact that closures in Scala 2.12 are now implemented using native lambdas that were introduced in Java 8. So, the new ClosureCleaner has to implement all of the steps of identifying closures, getting the list of captured variables and pinpointing which of them need to be cleaned based on the new 2.12 implementation.

This was a long anticipated upgrade that has been discussed in many user groups (here, and here). It blocked many projects from adopting Scala 2.12 within projects that used Spark. Lightbend took the lead to finish the remaining issues for making Spark work with Scala 2.12, in collaboration with Apache Committer, Sean Owen, and other contributors. Besides updating the existing closure cleaner functionality, it was also necessary to handle a parsing ambiguity introduced by Java 8 lambdas. The details of the work can be found in this design document.

The Benefits And Capabilities Of Scala 2.12 For Spark

In 2.11 Scala generated an anonymous inner class for every closure, which was considered quite space inefficient. In comparison, Scala 2.12 generates native Java 8 lambdas for every closure in the code. As the Scala 2.12 release notes mention:

Scala 2.12 emits bytecode for functions in the same style as Java 8, whether they target a FunctionN class from the standard library or a user-defined Single Abstract Method (SAM) type.

For each lambda the compiler generates a method containing the lambda body, and emits an invokedynamic that will spin up a lightweight class for this closure using the JDK’s LambdaMetaFactory.

So, the main advantage compared to the approach of Scala 2.11 is that 2.12 does not need to generate the anonymous class in most cases. There are some corner cases though where 2.12 even needs to generate the anonymous class.

With Spark supporting Scala 2.12, it’s now easier to use other libraries that support Scala 2.12, such as Kafka, machine learning libraries like Breeze, etc.

Above all that means that people can use 2.12 across all their projects.

The Role Of Lightbend's Engineering Team

Several teams within Lightbend contributed to this effort. Stavros Kontopoulos and Debasish Ghosh from the Fast Data Team (namely) collaborated with the core Scala compiler team, in particular Lukas Rytz, Adriaan Moors, and Jason Zaugg to design and implement the update.

Early discussions with Scala creator and Lightbend co-founder, Martin Odersky, also helped stimulate ideas. This detailed design document formalized the details. It was reviewed both internally within Lightbend and with the core Spark committer team. Finally, the update was implemented and merged into Spark’s master branch August 2nd. At the same time we also pinged the Flink community who have borrowed the closure cleaner functionality from Spark to upgrade their project and thus Flink 1.7 has came out with Scala 2.12 support.

What’s Coming In 2019 (And Beyond)

At Lightbend we have always worked to benefit the community of Scala users and support major projects that use Scala. This time we focused on moving forward one of the most popular Apache projects, Spark, by bringing it to the exciting world of Scala 2.12 and we will continue working hard to deal with future challenges that may appear with newer versions of Scala.

The latest Spark 2.4 release supports Scala 2.12 in experimental mode. Users can download a the related build from here. Scala 2.12 has become the default language for the upcoming Spark 3.0.0 release, support for 2.11 is being removed as there is an ongoing work to support Java 11 at the same time. For the latter check the discussion here.

Author

Stavros Kontopoulos

Senior Engineer, Lightbend, Inc.

Twitter: @s_kontopoulos GitHub: skonto

Stavros is a senior engineer on the fast data systems team at Lightbend, where he helps with the implementation of the Lightbend's fast data strategy. He has worked for several years building software solutions that scale in different verticals like telecoms and marketing. His interests among others are: distributed system design, streaming technologies, and NoSQL databases.