Scala Prof

Wednesday, April 6, 2016

Feeling lazy -- memoization

Scala has a neat feature in the lazy val, which is somewhere between a val and a def. A val is evaluated exactly once, while a def is evaluated any number of times (0..n). A lazy val is evaluated at most once. In that sense, it's like a single-valued cache: no evaluation takes place if it's never needed but it will be evaluated only once if it's needed several times. One particularly useful use-case is when repeatedly referencing the value of a call-by-name method parameter.

But there's no multi-valued (collection) equivalent. You can create a "view" of a collection which is implemented as an IterableView. This is particularly useful when you want your collection to undergo several transformations (for example, by invoking map). Applying map to such a view essentially composes the existing function-to-be-applied with the new function. When the view is materialized (by calling force or apply) the (composed) function is applied to the member(s) that you need. However, if you apply force (or apply) a second or third time, the composed function will be invoked again. You can, if you like, "memoize" the result of the force or apply method, but you can't memoize them one element at a time.

So I recently asked a question about it on StackOverflow just in case there was something I was missing.

Let's have a brief interlude here on the meaning of memoization. A frequent issue when optimizing a program for performance is when a function is called with the same parameter (or parameters) as before. The opportunity to optimize this only occurs, obviously, if the function is referentially transparent (i.e. its context is immutable). Otherwise, we have no choice but to simply go through the evaluation again. But in functional programming in particular we will often have the situation where we can consider remembering the parameters and their result in a lookup table and, if those parameters ever arise again, we pass back the result instead of reevaluating the function. This is called "memoization", an ugly word and one which my spell-checker absolutely refuses to allow me to type without putting up a big fight. Why not just call it a cache? Well, according to Wikipedia, the reason is that it's a special case of caching (although I'm not quite buying the argument). For this reason (and because of spell-check) I'm going to use the word cache going forward.

You'd think that something that obvious would be part of the basic Scala language. But it isn't. What we want really is just a sub-class of Function1 where we provide a means of fulfilling (evaluating) the value.

Here's the simplest way to define such a cache:

class Cache[K,V](evaluate: K=>V) extends mutable.HashMap[K,V] {
  override def apply(k: K): V = getOrElseUpdate(k,evaluate(k))
}

This works exactly as you would expect. But the class (or its equivalent) doesn't appear in the Scala collections library. And all of the other methods of HashMap are exposed which doesn't really conform to the contract of a simple memoization class.

The other drawback to this class is that it doesn't deal well with exceptions that might arise in the evaluate function. Basically, they simply bubble up to the top. And, even if they are caught somewhere, they will be thrown again the next time the wrong key is requested.

A somewhat more robust version of Cache, however, can easily be built. This is what I'm using in an (Akka) actor-based project that I'm working on:

class Cache[K,V](evaluate: K=>Try[V])(implicit carper: String=>Unit) extends (K=>Option[V]) {
  def apply(k: K): Option[V] = m.getOrElseUpdate(k,fulfill(k))
  private val m = mutable.HashMap[K,Option[V]]()
  private def fulfill(k: K): Option[V] = evaluate(k).recoverWith{case x => carper(s"Cache: evaluation exception for key $k: $x"); Failure(x)}.toOption
}

Note that the return type of evaluate has changed: it's now a Try[V] instead of just V. And the apply method now returns an Option[V] instead of simply V. This makes it much more like invoking get on a collection. As a side-effect, when evaluate yields a failure, the cause of the failure is passed to carper, along with an explanatory comment. The failure then results in None being memoized which means that future calls to apply with the same parameter will return a None without causing the exception to recur.

Sunday, March 13, 2016

Composing with Try

Because scala.util.Try is a monad, we can compose tries together conjunctively. In the excellent Neophyte's Guide to Scala, Daniel Westheide shows some examples of this. Paraphrasing his for-comprehension exemplar slightly we have:

  for {

      url <- Try(new URL(s))

      connection <- Try(url.openConnection())

      is <- Try(connection.getInputStream)

    } yield Source.fromInputStream(is).getLines()

Each try properly depends on the success of its predecessor. That's to say, this is a sequential conjunction.

But suppose we want to compose tries disjunctively? In other words, success is not defined (as above) when all components succeed but when one succeeds. And again, we want our composition to be sequential (the first try is tried first). How should we go about that?

The obvious solution is to use orElse:

  val x1 = "10.0" 

  val x: Try[Any] = Try(x1.toInt) orElse Try(x1.toDouble)

This works just as expected: the value of x is Success(10.0).

Indeed, orElse has the signature:

deforElse[U >: T](default: ⇒ Try[U]): Try[U]

Ideally, we'd like another signature something like this:

  def orElseTry[U >: T](alt: => U): Try[U] = orElse(Try(f))

This compares with orElse as does map to flatMap or recover to recoverWith. And we might invoke it thus:

def toNumber(x: String) = Try(x.toInt) orElseTry(x.toDouble)

This looks pleasantly elegant and simple--and it differs only from the normal way of doing things by one space. Indeed, I was able to implement it quite easily in a copy of scala.util.Try.scala. However, in their infinite wisdom, the fathers of Scala did not provide such a method (actually, Try derives from the folks at Twitter rather than EPFL).

Unlike with the conjunctive set of Try operations that we started with, the disjunctive set is much more likely to be based on a single input, as in the case of the toNumber example above. I therefore set about creating a class to support multiple successive Try operations:

  case class Trial[V, T](f: V=>Try[T]) extends TrialBase[V,T](f)

  abstract class TrialBase[V, T](f: V=>Try[T]) extends (V=>Try[T]) {

    def apply(v: V): Try[T] = f(v)

    private def orElse(f: V=>Try[T], g: V=>Try[T]): V=>Try[T] = {v => f(v) orElse g(v) }

    def |: (g: V => Try[T]): Trial[V,T] = Trial(orElse(g,f))

    def :| (g: V => Try[T]): Trial[V,T] = Trial(orElse(f,g))

    def ^: (g: V => T): Trial[V,T] = Trial(orElse(Lift(g),f))

    def :^ (g: V => T): Trial[V,T] = Trial(orElse(f,Lift(g)))

  }

This case class called Trial takes a function f of type V=>Try[T] such that its instances are composable with other functions of type V=>Try[T] (using |: or :|) or even V=>T functions (using ^: or :^). These latter functions are "lifted" to V=>Try[T]. Such Trial objects can be composed with other functions both leftwards and rightwards using the |: and ^: operators with the colon on the proper side. And, of course we can even use partial functions to allow us to fork the path of control. Let me provide an example taken from a generic CSV parser that I created:

  val defaultParser = Trial.none[String,Any] :|

    {case s @ (date0(_) | date4(_) | date1(_)) => dateParser(s)} :^

    {case quoted(w) => w} :^

    {case whole(s) => s.toInt} :^

    {case truth(w) => true} :^

    {case untruth(w) => false} :^

    {case s: String => s match { case floating(_) | floating(_,_) | floating(_,_,_) => s.toDouble}} :^

    {case s: String => s}

  val dateFormatStrings = Seq("yyyy-MM-dd","yyyy-MM-dd-hh:mm:ss.s") // etc.

  val dateParser = Trial[String,Any]((parseDate _)(dateFormatStrings))

  val date0 = """(\d\d\d\d-\d\d-\d\d)""".r

  val date1 = ??? // omitting for brevity

  def parseDate(dfs: Seq[String])(s: String): Try[DateTime] = ??? // omitting for brevity

  val quoted = """"([^"]*)"""".r

  val whole = """(\d+)""".r

  val floating = """-?(\d+(\.\d*)?|\d*\.\d+)([eE][+-]?\d+)?[fFdD]?""".r

  val truth = """(?i)^([ty]|true|yes)$""".r

  val untruth = """(?i)^([fn]|false|no)$""".r

In this fragment, we define a default element parser for our CsvParser object. Trial.none simply starts the chain of functions (we could also put the date parsing function as the argument to Trial). If the input String is recognized as a valid date format (in reality there are many more possible date formats), then we invoke dateParser with the String as its parameter. Otherwise, we try to interpret the String as a quoted string, an integer, a boolean, etc.

If you're interested in looking at the code you can see it here: Trial.scala, while the specification is here: TrialSpec.scala.

Thursday, March 3, 2016

Scala job interview questions

One of my students from last semester asked me what questions about Scala might arise in an upcoming job interview. Of course, there are such lists available already for example the following:

An excellent (IMO) list of questions without answers (which is good): Scala Job Interview Questions
A brief (but good) list courtesy of Typesafe... er Lightbend (including very thorough answers): 4 Interview Questions for Scala Developers
A list of questions with answers (but I'm not sure I like the answers): Top 25 interview questions on Scala
etc.

Naturally, I couldn't resist the idea of writing up some of my own questions/answers:

Question: Name two significant differences between a trait and an abstract class.

Answer: Abstract classes have constructors with zero or more parameters while traits do not; a class can extend any number of traits but only one abstract class; see also To trait or not to trait? for more differences.

Question: What type of construct follows the match keyword?

Answer: A partially-defined function.

Question: What exactly is wrong with a recursive function that is not tail-recursive?
Answer: You run the risk of running out of stack space and thus throwing an exception.

Question: Explain the concept and usage of an extractor.
Answer: An extractor is essentially the opposite of a constructor and is defined by the unapply(t: T) method defined in the companion object of a type T; for a case class, which provides a built-in unapply method, it returns a Option[TupleN] based on the number N of parameters in the constructor (exception: if N=0, then the returned type is Boolean; if N=1, the returned type is Option[T]); extractors are typically used following the case keyword in match expressions. For more information, see The Neophyte's Guide to Scala Part 1: Extractors.

Question: What relatively minor difference in the definitions of their respective non-empty-constructors results in the major distinction between a List and a Stream?
Answer: The tail parameter is defined as call-by-name.

Wednesday, January 6, 2016

The curious incident of the Scala enum

One of the strangest things about Scala is its concept of enumerations, typically implemented by extending Enumeration, for example:

  object Suit extends Enumeration {

     type Suit = Value

     val Spades, Hearts, Diamonds, Clubs = Value

}

I won't go into the details of the syntax of this (I don't understand it myself!) but I will observe that this mechanism works reasonably well. The only odd thing here is that if you actually want to refer to Spades, for instance, in your code, you will need to write something like this:

  import Suit._

But you will find that the suit values have an ordering that you would expect without you having to do anything about it, as in this bit of Scalatest code for instance:

   "suits" should "be ordered properly" in {

     val suitList = List(Clubs,Hearts,Spades,Diamonds)

     suitList.sorted shouldBe List(Spades,Hearts,Diamonds,Clubs)

}

But what if we want to know the color of a suit? How can we add a method to this enumeration (in a manner similar to the way it is done in Java?). It's tricky! -- my first (obvious) attempt did not succeed. I am indebted to Aaron Novstrup and StackOverflow for this somewhat elegant solution:

  object Suit extends Enumeration {

    type Suit = Value

    val Spades, Hearts, Diamonds, Clubs = Value

    class SuitValue(suit: Value) {

      def isRed = !isBlack

      def isBlack = suit match {

         case Clubs | Spades => true

         case _              => false

}

}

    implicit def value2SuitValue(suit: Value) = new SuitValue(suit)

}

There is a competing style of enum in Scala: based on a case object. If you're already familiar with case classes (you should be if you're reading and understanding this), then you will probably know that a case object is just like a case class but has no parameters. One big advantage of using case objects is that matching will result in warnings if we miss a case. It's also very obvious if we want to add further methods such as our color methods. Note the trait Concept. In reality, I also have defined Rank in addition to Suit, and since they have very similar implementations, they both extend Concept.

So, let's implement our suits again:

  trait Concept extends Ordered[Concept] {

    val name: String

    val priority: Int

    override def toString = name

    def compare(that: Concept) = priority-that.priority

}

  object Concept {

    implicit def ordering[A <: Concept]: Ordering[A] = Ordering.by(_.priority)

  }

  sealed trait Suit extends Concept {

    def isRed = !isBlack

    def isBlack = this match {

      case Spades | Clubs => true

      case _ => false

}

}

  case object Spades extends Suit { val name = "Spades"; val priority = 3 }

  case object Hearts extends Suit { val name = "Hearts"; val priority = 2 }

  case object Diamonds extends Suit { val name = "Diamonds"; val priority = 1 }

  case object Clubs extends Suit { val name = "Clubs"; val priority = 0 }

The key to this working correctly (and allowing, for example, List(Spades,Clubs) to be sorted without any fuss) is the ordering definition in Concept. It would be possible for Concept not to extend Ordered but then we would not be able to compare two suits directly using a "<" or similar operator.
If you decide to implement this without trait Concept and simply defining its methods directly in Suit, you would find that you do not need to define any implicit ordering. However, in order to sort a list of suits, you would have to explicitly define the type of your list as for example:

  List[Suit](Clubs,Spades)

This is because Suit appears as Suit with Product etc. This is the kind of thing that makes working with implicit values so tricky.

If you'd like to see the code for all this, including unit tests, then go to my Scalaprof site on github. and drill down to edu.neu.coe.scala.enums.

Monday, December 28, 2015

MapReduce with Akka actors

I was creating a lecture on actors, in particular Akka, for the Scala class that I teach. I wanted to do something simple and general. I showed the students an application I had previously written which used actors to manage a hedge fund. But I wanted something even more generic so I thought about implementing map/reduce. After all, actors (especially remote actors) are the preferred means of implementing concurrent processing in Scala, especially where the code to be run in parallel involves representing mutable state.

So, I did the obvious -- I looked on the internet for some code that someone had already written. I didn't want to make a big deal out of it, or spend a lot of time on it. Unfortunately, I couldn't find anything. Well, I couldn't find anything good. There were several half-hearted attempts at doing it but nothing that I would want to use. So, I set out to do it myself. The good news was that I got something (a word count application) working in about an hour. But after sharing it with the students, I realized that it, too, wasn't nearly as good as it could be.

So, I have refined it to a point (after many hours of work) that I am now proud to share it. What, to me, is particularly elegant is the functional API that I created to set up map-reduce stages that can be chained (or piped, rather) together to form a (potentially) multi-stage parallel process. Using this interface actually hides all details of actors (except for creating the actor system and a couple of implicit values) from the application programmer.

If you're interested in taking a look, contributing, or whatever, please refer to Majabigwaduce on Github. The README will explain, hopefully, everything you need to know about using the framework. And, of course, the source code is all there too. As I searched for an appropriate name for something that uses Akka to implement Map/Reduce, I couldn't help the name Majabigwaduce from popping into my mind. See the README for an explanation.

I have used my own naming convention everywhere in the code. If you're wondering about it, please see my previous blog.

Tuesday, December 8, 2015

Naming of identifiers

One of the conventions recommended by the designers of Scala is to use short identifier names. It's fine to make method names long (especially when they are private methods whose meaning may not be entirely clear otherwise) but context usually gives a pretty good clue to identifiers.

You've probably noticed code like this (taken from the Artima web site: Working with Lists):

def isort(xs: List[Int]): List[Int] =
    if (xs.isEmpty) Nil
    else insert(xs.head, isort(xs.tail))

where xs is used for a sequence of things and pronounced x's (the plural of x).

Here, I want to present a recommendation for generalizing this notion. In the same way that a sequence (Seq) of x is written xs, then a Future of x should be written xf. A Try of x might be written xt but I'm going to recommend something slightly different (xy) because t is useful for Tuple. So, the general scheme is:

val ac = C[A]

where C is some container type and A is some element type. Obviously, these can be composed such that if A happens to be a sequence of X (for example) then we end up with:

val xsc = C[Seq[X]]

Note that the order of the container/type definitions is mirrored for the name of the identifier. The advantage of all this is that when a reader (or even yourself) comes to read that code later, it is clear what the type of the identifier is, even if it isn't shown explicitly. If you're not reading the code through an IDE, then having an indication of type will be quite helpful. It also makes debugging a lot easier.

In a for-comprehension, patterns/generators look very simple: for (x <- xs) yield x, for example.

Here are my recommendations for the abbreviations (but of course you should choose your own):

Container	Abbreviation
Seq/List, etc.	s
Array	a
Map	m
Future	f
Try	y
Promise	p
Tuple	t
Option	o
Iterator/iterable	i
Either	h
RDD	r

Type	Abbreviation
My own type	x
Generic type A, B, T, etc.	a, b, t, etc.
String (line/document)	g
String (word)	s or w
URL	u
Double	d
Int	n
Function	f
ActorRef	a
Exception	e

Sometimes you might have more than one identifier of the same type in which case you can follow the single letter with 1, 2, etc. Or, alternatively, you might use a single uppercase character to add context, for example gD for a document string and gL for a line string. You might also decide to use t2 for a Tuple2, etc. In the situation where we have an Either of form Either[String,X] or Either[Throwable,X] then we simply use the identifier xe, ignoring the left-side types.

Some confusion might arise between a as an Array (i.e. a container) and a as a generic type A (or ActorRef). This should be clear in that types are found at the start of the identifier string. Similarly, f as a Future and f as a Function. And s if you use it both for a String and a Seq.

Here are some examples of code which uses this idea:

  def flatten[X](xsfs: Seq[Future[Seq[X]]])(implicit c: ExecutionContext): Future[Seq[X]] = Future.sequence(xsfs) map {_ flatten}

  def sequence[X](xy: Try[X]): Either[Throwable,X] =

    xy match {

    case Success(s) => Right(s)

    case Failure(e) => Left(e)

}

  def sequence[X](xos : Seq[Option[X]]): Option[Seq[X]] = (Option(Seq[X]()) /: xos) {

    (xso, xo) => for (xs <- xso; x <- xo ) yield xs :+ x

  }

  val ws = List("http://www.htmldog.com/examples/","http://www.google.com")

  val uys = for ( w <- ws ) yield Try(new URL(w))

Friday, November 20, 2015

Avoiding vars in Scala

One of the difficulties that everyone new to functional programming has is avoiding vars and mutable types. These aren't the same thing of course but they're similar enough for purposes of this discussion.

Of course, the first thing to understand is why are such mutable objects considered harmful--why would we want to avoid them? Without getting into an essay on referential transparency, we should aim to preserve RT in our code for the following reasons: referentially transparent programs are reliably testable and potentially provable.

So, let's say we have some code like the following and we want to avoid using vars and make it more functional (so that we can write unit tests on it):

  def printStatistics(xs: Seq[Double]) {

    var c = 0

    var s = 0.0

    var v = 0.0

    for (x <- xs) {

      if (x >= 0 && x <= 1) {

        c += 1

        s += x

        v += x*x

}

}

    println(s"Mean: ${s/c}, Std. Dev: ${math.sqrt(v/c)}")

}

I chose this particular bit of code because it involves two of the features that most seem to get in the way of functionalizing the code, viz. calculating more than one quantity and having complex logic. Here, we want to calculate both the mean and the standard deviation of a sequence of values. However, we also want to exclude, for some reason, values that are outliers.

We can deal with the multiplicity of quantities by turning to one of the functional programmer's best friends: the tuple. Let's group our three quantities together in a 3-tuple: (c,s,v) for the purposes of evaluation and then we will return a 2-tuple (mu, sigma) as the result of the method. Secondly, the logic which was complicating the calculations can in this case be easily refactored: we filter the values of xs.

Finally, there is one aspect of rolling up vars that is definitely tricky: because iteration uses vars we need to map the iteration-with-vars code into recursive code. Again, without getting into a formal proof, we can say that recursion (with immutable objects) is the dual of iteration (with mutable objects).

How are we going to turn this particular code fragment into something recursive? What's the appropriate pattern? Well, here another feature of functional programming comes to the rescue: higher-order functions. In particular, there is a family of higher-order functions which provide the necessary recursive pattern: Foldables. A Foldable is an extension of Functor (which defines the map method) and looks something like this:

  trait Foldable[F[_]] extends Functor[F] {

    def foldLeft[A,B](z: B)(f: (B, A) => B): B

    def foldRight[A,B](z: B)(f: (A, B) => B): B

}

Where F[_] represents a collection type such as Seq[A] or List[A]. z is the initial ("zero") value of our accumulator, f combines the current accumulator (first parameter) with the current element. After all elements of our collection have been visited, we return the accumulator as the result. If A happens to be a monoid (a type that has its own zero and combine methods), then we can define a reduce method:

      def reduce[A](): A

But foldLeft and foldRight don't actually require monoids because we provide the zero method and the combine method explicitly. This is what enables us, in the present instance, to keep two separate sums going, one for the mean and one for the variance. This is what the final code looks like:

  def getStatistics(xs: Seq[Double]) = {

      val y = xs filter { x => x>=0 && x<=1 }

      val r = y.foldLeft[(Int,Double,Double)]((0,0,0)){case ((c,s,v),x) => (c+1,s+x,v+x*x)}

      (r._2/r._1,math.sqrt(r._3/r._1))

}

Here, we are explicitly counting the elements of y (although y.size would do just as well). First we filter the input to remove outliers and then we perform our calculations using foldLeft. The initial value is easy, of course: (0,0,0). The combiner function is also quite straightforward: strictly speaking it is a partial function matching the case where the inputs are presented as (c,s,v) [the accumulated values] and x [the current element from the sequence]. The returned value is simply: (c+1,s+x,v+x*x).

Finally, we return a tuple of the mean and standard deviation. We can of course print these values as was done before in the printStatistics method. But we can also test these values in unit tests to ensure that we have our logic right.