Wednesday, June 21, 2017

Some observations on foldLeft with reference to short-circuiting

Have you ever written code such as the following and expected it to quit accessing values of y after it hits the first true value?

    val y = Seq[Boolean](false, true, ..., false)
    val x = y.foldLeft[Boolean](false)(_ || _)

After all, the second parameter of || is call-by-name so, if it's not needed to determine the outcome of the value for x, then that should mean that the recursion of the foldLeft should stop, right?

But it doesn't in practice, and if you look at the definition of foldLeft, you can see why. There is no short-circuit defined.

I therefore created a method in the FP module of my LaScala library to be the short-circuit equivalent of foldLeft. It's called foldLeftShort and it looks like this:

  /**
    * This method performs a foldLeft with short-circuit logic
    *
    * @param xi an iterator of X objects
    * @param y0 the initial value of Y (e.g. 0 if Y is Int)
    * @param q  the quick-return function: if this evaluates to true, given the new y value, that value is immediately returned
    * @param f  the function which accumulates the result: as with foldLeft, it takes a Y and an X and yields a Y
    * @tparam X the underlying type of the iterator
    * @tparam Y the return type
    * @return the aggregate of the given iterator, according to the initial value of y0 and the accumulator function f.
    */
  def foldLeftShort[X, Y](xi: Iterator[X], y0: Y, q: Y => Boolean)(f: (Y, X) => Y): Y = {
    def inner(_y: Y, xo: => Option[X]): Y =
      xo match {
        case None => _y
        case Some(x) =>
          val y = f(_y, x)
          if (q(y)) y
          else
            inner(y, nextOption(xi))
      }

    inner(y0, nextOption(xi))
  }

  /**
    * Method to get the next element from an iterator, as an Option.
    *
    * @param xi the iterator.
    * @tparam X the underlying type.
    * @return Some(x) where x is the next element; or None where the iterator is exhausted.
    */
  def nextOption[X](xi: Iterator[X]): Option[X] = if (xi.hasNext) Some(xi.next) else None

As with foldLeft, the accumulating function is in its own parameter set. But the initial value is joined by the iterator itself and the short-circuit function in the first parameter set. It would be nice to be able to extend the Iterator type to include this.

The inner function isn't defined in the typical way, using the concatenation unapply method (::) because, for some reason that I don't quite understand, you cannot decompose an Iterator this way. Instead, we make use of another FP method: nextOption which returns the next element of the iterator as an Option (see above).

Note the short-circuit mechanism, which really does work. As soon as we have a value by applying f to the current accumulator and the newest value from the iterator, we return if f evaluates to true.

In order to use this as efficiently, as possible, we need to write code something like the following where in this example, we want to save doing a trivial lookup too many times. That's not worth the effort--but just suppose that the lookup was actually an expensive database query instead:

    val lookupTable = Map("a"->false, "b"->true)
    val boi = ns.view.map(x => lookupTable.get(x)).toIterator
    FP.foldLeftShort[Option[Boolean], Boolean](boi, false, b => b)(_ || _.getOrElse(false))

Obviously, if you want to end up with an Option[Boolean], you would write something like this instead:


FP.foldLeftShort[Option[Boolean], Option[Boolean]](boi, Some(false), b => b.getOrElse(false))(optionalOr)

/**
    * NOTE: that we can't use map2 here because we are essentially implementing three-valued (Kleenean) logic
  *
  * @param b1 the accumulator: if b2 is None or Some(false) then b1 is returned unchanged
  * @param b2 the value: if b2 is Some(true) then Some(true) will be returned
  * @return an Optional[Boolean]
  */
def optionalOr(b1: Option[Boolean], b2: Option[Boolean]): Option[Boolean] =
    b2 match {
      case Some(x) => for (b <- b1) yield b || x
      case None => b1
    }

Note that, if you use map2(_ || _) instead of optionalOr, the logic is probably not going to give you exactly what you want, because None will combine with false to give None, whereas you really want None to have no effect on the accumulator value at all.

Saturday, June 17, 2017

Another cautionary tale of Spark and serialization

In my earlier post on the subject of Spark and Serialization, I forgot one big gotcha that has come up a couple of times for me (I thought I had mentioned it, but now I see that I didn't).

The problem is with Stream and other non-strict collections such as Iterator or SeqView. These are inherently unserializable (except for empty collections). I'm sure there are several posts on this in StackOverflow, although I couldn't actually find one when I searched just now. But, I hear you say, these are never the kind of thing you would want to store as a (non-transient) field in a class.

That's true of course, but you might inadvertently include one, as I did just recently. Let's say you have a case class like this:

case class Person(name: String, friends: Seq[String])

You create instances of this class by running a database query on the person and getting his/her friends as a collection of strings. Something like this:

val x = sqlContext.sql("select ...").collect
val p = Person("Ronaldo", x.toSeq)

The problem here, is that, if the type of x is Stream[String], then x.toSeq will not change its type since Stream extends Seq. You need to write x.toList instead (or x.toVector).

Actually, it's much better practice never to use Seq for persistable/serializable objects. Far better to force an explicit collection type in the definition of the case class. In particular, you should use either List or Vector, unless you have some specific reason to use a different sequence. In this case, that would be:

case class Person(name: String, friends: List[String])