Saturday, June 17, 2017

Another cautionary tale of Spark and serialization

In my earlier post on the subject of Spark and Serialization, I forgot one big gotcha that has come up a couple of times for me (I thought I had mentioned it, but now I see that I didn't).

The problem is with Stream and other non-strict collections such as Iterator or SeqView. These are inherently unserializable (except for empty collections). I'm sure there are several posts on this in StackOverflow, although I couldn't actually find one when I searched just now. But, I hear you say, these are never the kind of thing you would want to store as a (non-transient) field in a class.

That's true of course, but you might inadvertently include one, as I did just recently. Let's say you have a case class like this:

case class Person(name: String, friends: Seq[String])

You create instances of this class by running a database query on the person and getting his/her friends as a collection of strings. Something like this:

val x = sqlContext.sql("select ...").collect
val p = Person("Ronaldo", x.toSeq)

The problem here, is that, if the type of x is Stream[String], then x.toSeq will not change its type since Stream extends Seq. You need to write x.toList instead (or x.toVector).

Actually, it's much better practice never to use Seq for persistable/serializable objects. Far better to force an explicit collection type in the definition of the case class. In particular, you should use either List or Vector, unless you have some specific reason to use a different sequence. In this case, that would be:

case class Person(name: String, friends: List[String])

1 comment:

  1. Very Good Post! Thank you so much for sharing this nice post, it was so good to read and useful to improve my Technical knowledge as updated one, keep blogging.
    Scala and Spark Training in Electronic city

    ReplyDelete