Typesafe's AMA Podcast Ep. 06 feat. Ways to Make Spark Easier with Justin Pihony
For this Typesafe Podcast, I sourced a new team expert to teach me about a tool I hadn't grasped yet. I like that Justin got out info both to the data scientists and to Java devs. Spark speaks to both!
Tonya Rae Moore: Hey, gang, and welcome to Typesafe AMA podcast with our experts, number seven (Tonya's note: it’s actually #6. Ooops!). My name is Tonya Rae Moore, and now I am the Community and Field Program Manager with Typesafe. I got a promotion, y'all! I write blogs, I record our webinars, and I get to ask people questions about stuff that I don't know anything about. And today is a case in point. We have a new Typesafe expert, and his name is Justin Pihony. I met him -- Justin, I think we met, what, a couple weeks ago?
Justin Pihony: Yeah, about two weeks ago.
Tonya Rae Moore: So we'd been in some email chains, but it was the first time we got to shake hands. It was a good time. So that's a good way to lead us in to telling me a little bit about your background, and how you got to Typesafe. You're new, right? How long have you been working for us?
Justin Pihony: I am new. I -- let's see, we're in February? So I've been here about two months now.
Tonya Rae Moore: Aw, you're just a baby.
Justin Pihony: Yep. I've been in the Scala ecosphere in my -- more in my spare time for quite some time, but, realistically, I'm coming from a different viewpoint. I'm coming from the dot net world where I've done C# for ten plus years. But a few years back, I had a coworker who pointed Scala at me, I took the Coursera course...and everybody knows Jamie Allen in the Scala world, and he's an old family friend. And I saw him at a conference at CodeMash about three or four years ago, and we got to talking more about Scala, and I dug in there. From then, I started heading up the local Pittsburgh Scala meet up, answering questions on stackoverflow because, hey, the best way to learn is to answer questions.
Tonya Rae Moore: Whether you know the answer or not.
Justin Pihony: Yeah. Definitely. I mean, I try my best, but, yeah. So, on stackoverflow, they'll point you out. You get down votes and up votes, so. But you don't know it, they'll call you out on it. Which is fine. And then it's kind of how I got in to Typesafe, as I am a screencast author through Pluralsight.com. I've got three courses out there now, and the course that really drew me into Typesafe, originally, was my course on Scala, so it's just an introduction to Scala, but I followed it with a blog post that really drew some eyes at the Typesafe office. And, more recently, and kind of to go with what we're talking about, is I released a -- what seems to be a very popular topic lately, and popular course of mine, is a Fundamentals of Spark, which, in fact, I'm going to be doing a part two on soon.
Tonya Rae Moore: That's what I really want to talk to you about, because, I have to be completely honest in my -- 'cause I don't know if you know, I came from the Java ecosystem. So, I've been at Typesafe about seven or eight months now, and I don't know anything about Spark. So, when a couple of people were like, hey, we haven't talked about Spark in a while. I was, like, BANG, let's do it. And you're the guy, right?
Justin Pihony: Sure, yeah. I'm the overall support engineer, but my background is very heavily Spark. Like I said, I have the Pluralsight course, and I've watched -- I like to put it out there that I've watched hundreds of hours of videos to get up to date on Spark. I mean, going back to 2013, so I know it pretty well.
Tonya Rae Moore: I am really pleased with that, because you are going to give me, and a bunch of other people, a lot of new information, this is going to be good. From the depths of my ignorance, can you start off by just giving me a description of what Spark is?
Justin Pihony: Well, I mean, realistically, you can go to the Spark site, and they have the one sentence blurb, and then I can expand on it. And that's "Apache Spark is a fast and general engine for large scale data processing." So, I mean, that's very general, right?
Tonya Rae Moore: Yes.
Justin Pihony: So what it really is, is it's kind of -- so, do you know Hadoop?
Tonya Rae Moore: I do.
Justin Pihony: Basically, it's not a replacement for Hadoop, as most people would put it; it's a replacement for Hadoop MapReduce. MapReduce has been the big data analytics tool for many, many years. Written in Java, and it's just full -- to write a Hadoop job -- MapReduce job is a horrible pain in the butt. It's lots of code, maintaining it, testing it, just everything. It's been out there, and it's been great, it's been better than your just regular command script, and things like that. But, Spark came along and it was written in Scala, so you automatically gain terseness, and expressiveness, from being written in Scala, because that's just the nature of Scala. And it does the same thing that MapReduce does, except it expands on it even more.
MapReduce is pretty much exactly what it says. It's map and reduce. Whereas Spark is -- its core is MapReduce, in a sense. You don't have to be stuck to map and reduce. You can write what looks like Scala code, it's just collections. Their core abstraction is called an RDD, and it's a Resilient Distributed Dataset. So, all that, realistically, is -- and I'm sure I'll catch some flak for this from people who really know what it is.
Tonya Rae Moore: They'll tweet at you, I promise.
Justin Pihony: Yeah, sure. But I like to say an RDD is really just a collection. So if you know the collection interface, then you can work with an RDD. You don't have to really care about if it's distributed or not, at least at the start. You just care, hey, I have a collection of data, and I need to manipulate it. So that's all Spark does. Now it made a splash because it has in memory caching, and that helped speed it up beyond Hadoop. But the core of it is that it's a big data analytics. You can split beyond that, split out the different framework for streaming and stuff.
Tonya Rae Moore: So, then, do Java developers have problems when they're trying to write Spark code?
Justin Pihony: So, with Spark code, like I said, it is written in Scala, so automatically that means it's Java also. And the Spark community has done a great job of making sure that the API is not too mangled. But, from the use case, Spark is lazy, so that's also kind of a boost on where -- over Hadoop. So, instead of automatically running some map side combiner, or something like that, it'll figure out what your whole process is, and then whenever you finally run an action, where Spark is built up by transformations and actions, whenever you finally run actions and say, hey, I want to actually save my data, or get my data, or whatever, that's when it'll run. So, behind that, it's all functional, right? So it's got laziness, and it's immutable. Any time you do a transformations, you get a new RDD back, the old one remains as it was. So that's kind of a difference in the Java world. Scala, you're used to it, but Java, you're used to mutating your data a little bit more.
So there's a lot of instances where I've seen Spark users -- Java Spark users, specifically, I've seen them write code where they say, hey, I have this variable, I want to update it and say -- I want to increment the counter. And if they want it locally on their machine, it will work if it's running across one core. But when they distribute it, all of sudden, that variable is distributed, and it's just a copy. So, it gets to the fact that it's immutable, and the Java developers need to embrace that immutability and use some of the other mechanisms within Spark to make sure that they're not counting variables directly, because it's not going to come back to the main program. It will only be on each worker, and then you're not going to get anything. In fact, I've seen where they say, hey, I'm going to put a count, say, increment it, and they -- well, we're going to put a print line, prove the point that it works. And then whenever they print line on their final output, it's still zero, because it was only on the workers.
And then the other thing that -- it's kind of a problem that I've seen in Scala in general, so it's your typical conversation, is there's -- it makes -- heavy use of implicits. So there are certain methods that automatically -- that you automatically get through inference. But, in Java, you have to be explicit about it. So if you look at the code in Scala, you might say, hey, I can use that function, but that function doesn't exist on the object directly, it's only there through implicit. And that's kind of just a Scala/Java divide in general, but as Spark grows in popularity, it's something that becomes a little more evident to those new Java developers coming in to that world.
And the only other thing that I have, is the biggest thing is, so we were talking about map and reduce, so the reduce side is -- you have to go back to your math degrees for this. Not your math degree, but your basic math, where you have to think of associativity and commutativity. Your typical reduce function is run locally, and you just have an accumulator that keeps building up. That's not true in Spark, and that's something that I've seen people get beat up on before, where they say, hey, I'm going to do a reduce, I'm going to say, here's the thing that I'm accumulating, and I just keep incrementing to it. Because Spark's distributed, it eventually merges everything together, and the order of operations is not guaranteed. So you have to, basically, think that the operation that you're running has to be associative. And that's kind of a split -- even in the Scala world you'll run in to that.
Tonya Rae Moore: Okay. So what I'm hearing you saying is that it's actually pretty easy to get started with Spark, but, eventually, users are going to hit a point where they're going to need help. Like it just stops being intuitive.
Justin Pihony: Definitely.
Tonya Rae Moore: Okay. So, since you've been at Typesafe, which I know is relatively, but, you know, in all of your travels, where to people need the most help? Where do they hit the wall? And what services -- I have to ask - what services does Typesafe have? What can we do to help?
Justin Pihony: Sure. So, the big thing, like I said, Spark is great to start. It's got an easy API, and they do a fairly decent job of abstracting out that it's a distributed operation. But, that distributed operation eventually leaks through in some parts. So, tuning your code can be something that's difficult. Now, there's automatic tuning in the new Data-frames APIs, which is basically -- you can write imperative -- I'm sorry, not imperative, declarative code, where you say, hey, I want to do this, and you let the Spark engine use the schema that's there and figure it out. But, if you're not using that, of even in those, sometimes the tuning can be an issue.
Things that I've seen that are out of memory exceptions, especially with regards to pulling your data back in to the main program, which is called the driver. So if you don't have enough memory on your workers, or if you just tried to pull too much data back, you're going to, ultimately, end up with an out of memory error. So that's something that I've seen common.
The other problem that I've seen more typically, is garbage collection. Now, like I said, the Data-frames API, behind that, there's a project called Project Tungsten, where they're trying to improve garbage collection. But, because Spark is functional, and keeps creating objects, there is a possibility of a high garbage collection overhead. And knowing the API can really help that, but, that's something that people struggle with, and that's something that they'll look for some support with. So, like I said, in my short time, like you pointed out, I've actually seen a ramp up in requests for Spark assistance, but I'm not sure if that's just more people getting in to it, or if it's just we have more customers that are starting to run in to more Spark things.
Tonya Rae Moore: I do think that we have had a lot more customers who are turning to Spark, which is one of the reasons why I wanted to get in touch with you. Because it's something that wasn't on my personal map for a long time, but now, all of a sudden, yeah, the requests are coming in, and people are talking about it. There's a lot of discussion around Spark. When Oliver did his survey, Spark kept coming up time after time again.
Justin Pihony: Yeah, definitely. Like I said, my Pluralsight course, I've put out -- I have three out there. My Scala course is doing pretty well, I have a Unit Test course that does pretty well. But, Spark, is in the top 100 pretty consistently. So, I have seen the numbers, and through support, through my course, through everything I've seen, Spark is just booming. It's really impressive to me.
Tonya Rae Moore: Okay. I wanted to ask you a little bit -- we've talked a little bit about Spark and Scala, since Spark is built on Scala, or written in Scala, it's got an easy API, blah, blah, blah, like, I get all that, but, what's the relationship with Spark and Scala compared to Java, and Python, and R? Is Spark driving developers to Scala? Or is it the other way around? How is this all working together?
Justin Pihony: I would say there has been a big influx of Spark driving people to Scala. They may start out writing their code in Java, because, like I said, it's automatically there, but the terseness, and everything that you get from Scala is definitely something that makes things a little bit more maintainable. Now they do have support for Java 8 Lambdas, and that has made a drastic difference in just the lines of code for your Spark Java code. But, I do think that a lot of people are driven to, hey, it's written in Scala, there's -- I know Matei chose -- Matei being the creator of Scala...er the creator of Spark!
Tonya Rae Moore: I almost had to call Martin to tell on you.
Justin Pihony: Yeah, right? So, I mean, there's a quote somewhere out on the internet where he directly said, he was looking for a lot of the things that have driven Scala adoption to begin with. Where the immutability being able to easily pass functions and data, and just work with it so much easier. So, I think the same reason that he chose Scala to build it in, is the same reason that people are drawn to it for using it within Spark.
Tonya Rae Moore: Yeah. Talk to me about the data languages. I want to hear about that, too.
Justin Pihony: Yeah. So, Spark is nice, because it does have this kind of broad API. So you've got Scala, and, automatically, Java, but they spend a lot of time in the other big data languages, being Python and R. It's not automatically up to parity. Python tends to be pretty good. R, I think is a little bit more of a laggard when it comes to parity in the API. But it's only been out there for -- officially out there for about six months or so. But, those languages tend to be -- from what I've seen -- now, I'm sure there's people who would disagree, but from what I've seen, they're used for prototyping more so than going to production. They tend to be the language of data scientists, where they just want to play around with the data, and Python and R being a little bit more dynamic in things, especially Python. They can just write the code they want to write, and create, almost, pseudocode, massage their data, and get it out there. And then they pass it off and it can be hardened.
So, the other thing that is important to point out here, is that we talk -- I said that Spark has driven people to Scala. Spark, at this point, is even creating some differences within Scala itself. There's a recent Martin Odersky talk that he gave at Big Data Scala By the Bay.
Tonya Rae Moore: I was there. I am familiar with it. It was a good talk.
Justin Pihony: Yeah. And, basically, I mean, you heard it, but the -- I only just recently ran in to it, and he basically said he wants to pull in some of the things that, again, it uses implicit, but to automatically figure out, hey, you have data that looks like it can benefit from schema type things, and maybe some laziness, and some other things. And, so, some of that Spark RDD collection is, and how that's written, looks like it's going to start to trickle back in to the Scala world. I'm not sure when, but he's talking about it, and that's an interesting fact right there.
Tonya Rae Moore: Don't you just love working with Martin?
Justin Pihony: Yeah, I mean, I haven't yet, so. I will be soon, I'll be going to be our regular engineering meeting at the end of the month, so I'm looking forward to that.
Tonya Rae Moore: So, something that you said earlier, I'm really interested in the big data languages, and how this works. So, can you tell me a little bit about what the difference in the UX is for Spark developers, versus the big data scientists using Spark? What is the difference and what does that look like?
Justin Pihony: Yeah, so, like I said, the big data scientists tend to be people who don't necessarily care about productionizing - creating applications, out of their work. What they want to do is figure out meaningful data from it. So, they want to say, hey, I have all this data, what are some of the things I can get? I want to learn from this data, which is another nice thing about Spark, in that it has multiple different branches for its framework. It's got streaming, it's got the SQL one we talked about, it's got graphing, and the one that I was referencing there, a machine learning library. So, for those data scientists, they don't need to write their own algorithms, if they don't want to. They can use what's already built in there, play with the data, massage it, do whatever they need, and then, what is considered the -- more the big data developer, tends to be the people who take that information and turn it in to applications. Things that can be repeated, or streamed over, and they kind of just harden those algorithms so that it can be rerun over and over, and released to Amazon, and put out in to the cluster, or whatever needs to be done.
There's also a great blog post written by Cloudera called Why Apache Spark is a Crossover Hit Today for Data Scientists. And I think the biggest delineation there is really just one side of it is data scientists playing with the data, and figuring it out, and the developers tending to harden it. And Spark is starting to bridge that gap, and it makes it a little bit more of a fuzzy line.
Tonya Rae Moore: It's fascinating that in such a multipurpose tool, for developers on one end, and scientists on the other, and it can help everybody, on both ends of the spectrum. I learned a lot at this, I'm really impressed, Justin. Thank you.
Justin Pihony: Yeah. No problem. I love teaching, helping people. That's why I'm on support, right?
Tonya Rae Moore: I love it. You are a great asset to our team. Let's do something fun. So, if I'm sitting here at my laptop, which I happen to be, and I was interested in, like, I think this talk is great, I'm so glad Tonya did this podcast with Justin, I want to start with Spark right now. What should I do?
Justin Pihony: The easiest thing is -- well, there's two different ways. If you want to, you could use the Typesafe activator, and I believe that that will pull -- there are some Spark projects that will pull in Spark for you. But if you just want to start playing with it, the easiest way is to go to spark.apache.org, they have a big download of Spark link, and, really, you just choose the -- once you go there, you choose the type that you want to go against, and I usually just choose the pre-built version that's for the latest version of Hadoop. And from there you just unzip it, put it on your computer, and inside of a bin directory, is something called the Spark shell, and it's just like the Scala REPL. So you can just go right there, you don't even need data, let's say, hey, I want to have a list and you can start playing with it from that REPL. And the other nice thing is that, whenever you send that REPL up, anytime spin up what's called a Spark context, it's the driver behind all of the Spark programs, is it builds a Spark UI for you. So you can start up -- you go to the website, you download this code, and pre-built for whatever you have already, being Hadoop, in my case, and spin up the REPL, play with the data, and you can even see the way that it's running by going to your local host:4040, and there's a really nice UI visualizing your data flow.
Tonya Rae Moore: Are there pitfalls I should look out for? Like, is there anything you know people have problems with trying to ramp up immediately?
Justin Pihony: As far as ramping up, I think there aren't too many pitfalls for the initial getting started.
Tonya Rae Moore: You referenced out of memory and garbage collection earlier, I was listening.
Justin Pihony: Yeah, definitely. No, most of the pitfalls kind of come later. I guess the biggest pitfall is the out of memory one, that's something that you could run in to early. Where, if you connect to a cluster, and you call -- you run a bunch of transformations, like mapping across your data, and then you say, hey, I want to collect, that's going to pull all your data back to your main -- your one machine, and you're going to get an out of memory, most likely. Because it's across the cluster for a reason.
Tonya Rae Moore: Well, you have a given us a complete overview. Pitfalls, products, things that we can do. Data scientists, developers, I think you've hit it all, Justin. I really appreciate that. I got one more question for you.
Justin Pihony: Sure.
Tonya Rae Moore: If that developer, or data scientist out there right now listening to this podcast is, like, $%#!@*! what do I do? Can they email you?
Justin Pihony: They can.
Tonya Rae Moore: What's your email?
Justin Pihony: My email is my name, so it's firstname.lastname@example.org
Tonya Rae Moore: That's fantastic. What's your Twitter?
Justin Pihony: I'm very transparent with my public persona, so even my Twitter ID is @JustinPihony.
Tonya Rae Moore: Oh, wow. So I'm just going to throw "Justin Pihony" out there, and something's going to find you in the ether, and it's going to get back, right?
Justin Pihony: I'm the only Justin Pihony out there, that's an interesting story for another time.
Tonya Rae Moore: And we've got you, I'm so excited! Justin, thank you so much for taking time tonight. You've worked to accommodate my strep throat schedule, and I appreciate it very much. And, I know that you have added knowledge to my world. I hope that you have helped other people out there, too.
Justin Pihony: Yeah, I was glad to help. We're here at Typesafe to help everybody expand on that.
Tonya Rae Moore: Oh, I love it. I love our team. We just want to help, that's all. All right, folks, thanks so much for listening! Justin has already told you how you can get in contact with him. You can always tweet at me @TonyaRae Moore, or tweet @Typesafe, it's really easy. We're all transparent around here. Thanks so much to my guest, Justin Pihony, and we will talk to you next time.