Wednesday, January 3, 2018

2018, Spark 2.2 and Scala 2.12

Here we are already in 2018. I haven't written much here lately but I have plenty of ideas. I just need time to write them up.

My new BigData/Scala class will begin next week and I have been thinking about what exactly it is about Scala that made it the choice for implementing Spark in the first place. Indeed, I've developed a new module for introducing Scala in that way, i.e. if you were trying to implement Spark, what features would you require in order to be able to do it. Here, for example, is a very promising lode of insights, starting with an answer from Mattei Zaharia, the originator of Spark, himself. But I was disappointed, there was nothing there that really said what Scala brought to the table, other than scalability.

In a sense, I've built my very own map-reduce-based, cluster-computing platform called Majabigwaduce. I thought it needed a less catchy name than Spark to get adopted. Just kidding! No, in truth I simply built an Akka-based map-reduce framework for pedagogical purposes. However, it does essentially share the same requirements as Spark so it's a good starting point to think about this question.

Here's the list that I came up with:
  • lazy evaluation 
  • functional composition 
  • closures 
  • serializable functions 
  • JVM
Let me briefly describe these. Lazy evaluation is important because, when you have a huge potential overhead for starting up a job on a cluster, you need to ensure that you're not doing it eagerly. You want to be lazy--put off as much as possible any actual work until the last moment--then do it all at once.

Another closely-related feature is functional composition. If you can compose two functions into one function, you have the potential of avoiding some overhead. For example, you have a million objects to process. You apply function f to each object and then you similarly apply function g. This requires two traverses of the million objects. But suppose that you can create a function h which has the same effect as applying f followed by g. Then you only require one traversal. This may not sound like much but, in practice, the full benefit of lazy evaluation would be impossible without functional composition.

Closures are functions which are partially applied such that only the expected input parameters are unbound. Strictly speaking, closures aren't absolutely necessary because programmers can always explicitly capture the other variables--or add them as input parameters. But they make programming so much more logical and elegant that they are almost essential.

Serializable functions: we need to be able to send "expressions" as objects over the net to the executors. This requires, at a minimum, the ability to consider said expression as an object, e.g. as a function. This further requires that a function can be represented as a serializable object. In Scala 2.11 and before, that serializable object is actually a Java Class<?> object. See my own question and answer on this in StackOverflow.

The JVM: this isn't 100% required but it is the ecosystem where the vast majority of big data work is carried out. So, it would be likely a huge tactical error to create a cluster computing platform that didn't run on the JVM. Nobody would be interested in it.

At the time that work on Spark was begun in 2009, Java was not even close to having any of these required features (other than running on the JVM). Scala had them all so it was the natural choice with which to implement Spark at the time. Java8 (released in 2014) (and later) has the third and fourth items on the list, but I don't think it can yet provide solutions for the first two. I could be wrong about that.

If you delved into my StackOverflow Q+A referenced above, you might have noticed one little fly in the ointment. Spark 2.2 cannot currently run with Scala 2.12. This is ironically because Scala 2.12 chose to reimplement its lambda (anonymous functions) as Java8 lambdas which you would think a good idea. But, apparently, some of the "closure cleaning" function that is necessary to ensure a lambda is serializable is no longer available using that implementation and so we are waiting for a fix. Spark has been historically very slow to adopt new versions of Scala but this one seems to be particularly irritating. It certainly means that I have to teach my Scala class that yes, they should get used to using Scala 2.12 for all of the assignments, but when we get to the Spark assignment, they will have to explicitly require Scala 2.11. I look forward to a fix coming soon.