Friday, November 20, 2015

Avoiding vars in Scala

One of the difficulties that everyone new to functional programming has is avoiding vars and mutable types. These aren't the same thing of course but they're similar enough for purposes of this discussion.

Of course, the first thing to understand is why are such mutable objects considered harmful--why would we want to avoid them? Without getting into an essay on referential transparency, we should aim to preserve RT in our code for the following reasons: referentially transparent programs are reliably testable and potentially provable.

So, let's say we have some code like the following and we want to avoid using vars and make it more functional (so that we can write unit tests on it):

  def printStatistics(xs: Seq[Double]) {
    var c = 0
    var s = 0.0
    var v = 0.0
    for (x <- xs) {
      if (x >= 0 && x <= 1) {
        c += 1
        s += x
        v += x*x
      }
    }
    println(s"Mean: ${s/c}, Std. Dev: ${math.sqrt(v/c)}")
  }

I chose this particular bit of code because it involves two of the features that most seem to get in the way of functionalizing the code, viz. calculating more than one quantity and having complex logic. Here, we want to calculate both the mean and the standard deviation of a sequence of values. However, we also want to exclude, for some reason, values that are outliers.

We can deal with the multiplicity of quantities by turning to one of the functional programmer's best friends: the tuple. Let's group our three quantities together in a 3-tuple: (c,s,v) for the purposes of evaluation and then we will return a 2-tuple (mu, sigma) as the result of the method.  Secondly, the logic which was complicating the calculations can in this case be easily refactored: we filter the values of xs.

Finally, there is one aspect of rolling up vars that is definitely tricky: because iteration uses vars we need to map the iteration-with-vars code into recursive code. Again, without getting into a formal proof, we can say that recursion (with immutable objects) is the dual of iteration (with mutable objects).

How are we going to turn this particular code fragment into something recursive? What's the appropriate pattern? Well, here another feature of functional programming comes to the rescue: higher-order functions. In particular, there is a family of higher-order functions which provide the necessary recursive pattern: Foldables. A Foldable is an extension of Functor (which defines the map method) and looks something like this:

  trait Foldable[F[_]] extends Functor[F] {
    def foldLeft[A,B](z: B)(f: (B, A) => B): B
    def foldRight[A,B](z: B)(f: (A, B) => B): B
  }

Where F[_] represents a collection type such as Seq[A] or List[A]. z is the initial ("zero") value of our accumulator, f combines the current accumulator (first parameter) with the current element. After all elements of our collection have been visited, we return the accumulator as the result. If A happens to be a monoid (a type that has its own zero and combine methods), then we can define a reduce method:

      def reduce[A](): A

But foldLeft and foldRight don't actually require monoids because we provide the zero method and the combine method explicitly. This is what enables us, in the present instance, to keep two separate sums going, one for the mean and one for the variance. This is what the final code looks like:

  def getStatistics(xs: Seq[Double]) = {
      val y = xs filter { x => x>=0 && x<=1 }
      val r = y.foldLeft[(Int,Double,Double)]((0,0,0)){case ((c,s,v),x) => (c+1,s+x,v+x*x)}
      (r._2/r._1,math.sqrt(r._3/r._1))
  }

Here, we are explicitly counting the elements of y (although y.size would do just as well). First we filter the input to remove outliers and then we perform our calculations using foldLeft. The initial value is easy, of course: (0,0,0). The combiner function is also quite straightforward: strictly speaking it is a partial function matching the case where the inputs are presented as (c,s,v) [the accumulated values] and x [the current element from the sequence]. The returned value is simply: (c+1,s+x,v+x*x).

Finally, we return a tuple of the mean and standard deviation. We can of course print these values as was done before in the printStatistics method. But we can also test these values in unit tests to ensure that we have our logic right.

Friday, November 13, 2015

More on Scala's Parser-combinators: building a concordance

I want to continue my discussion of Scala's parser-combinator library from a previous blog post. The ParseResult type gives us both the value that was parsed and the current state of the input. That Input value (called next in the Success case class that extends ParseResult) allows the parser not only to know what input remains for further parsing, but also can yield the current position in the original text (via the pos method). The pos method is invoked in the Failure case, but not in the Success case, which is normally what you want--once you've parsed your input successfully, you normally don't care where the various phrases actually occurred...

...except if you're goal is to build a concordance. It took me a little while to figure out how to extract the positions of words so I thought it might be helpful to show how it's done.

So, what should the type of my Parser be? The most obvious type seemed to be a Tuple of Int and String: giving us the index of the word and the word itself. And we want to be able to parse an entire String (or file) of many words so we need something like this grammar rule (in extended BNF):
sentence ::= { word }
Where word is defined using a regular expression to suit our needs (see below for specifics). The name sentence might be a bit misleading here because it's not a normal language sentence: it's simply a repetition of words and, in my definition, can include periods, commas, semi-colons, etc. Your needs might be different.

We can turn all this into Scala code using the rules described in my previous post:

  class ConcordanceParser extends RegexParsers {
    val rWord = """[\w’]+[,;\.\-\?\!\—]?""".r
    def word: Parser[String] = regex(rWord)
    def sentence: Parser[Seq[String]] = rep(word)
  }


The problem with this is that our parser still doesn't return the positional information. We need to take advantage of the method positioned that is defined for Parsers. The signature of this method is:

    def positioned[T <: Positional](p: ⇒ Parser[T]): Parser[T]

So, the type that we need for our parser must extend Positional. Well, that's not too hard, although it seems strange that there isn't one already defined. Let's define a case class:

  case class PositionalString(s: String) extends Positional
Now let's substitute PositionalString for String, pass the original word-parser, with result mapped to PositionalString, as the parameter to positioned:

  class ConcordanceParser extends RegexParsers {
    val rWord = """[\w’]+[,;\.\-\?\!\—]?""".r
    def word: Parser[PositionalString] = positioned(regex(rWord) ^^ {w => PositionalString(w)})
    def sentence: Parser[Seq[PositionalString]] = rep(word)
  }
That's it. All you have to decide now is the precise form of regex that you wish to use for the type of documents you want to make a concordance for. And how you want to treat the resulting output from the parser. For the complete source code, please go to: https://github.com/rchillyard/Scalaprof/tree/master/Concordance

[My apologies for the awful formatting, but Blogger seems to have regressed--I simply can't fix it without Blogger putting its bad formatting back]

Saturday, November 7, 2015

Simple, obvious, elegant

I am teaching a Scala for Big Data class for graduate engineering students. As we have been getting into the compositional and pure functional aspects of Scala, I've been trying to help the students solve problems. I have a mantra for these situations: Simple, obvious, elegant (SOE for short). Let's look at an example:

Let's say we are looking for a method which, given a function of type A=>B, would return a function of type F[A]=>F[B] where F is a container of some sort, such as Option, Seq, Try, etc. In particular, we will require that F be a functor because it is obviously going to require the map method. In other words we need a body for the following signature:
def lift[A,B](f: A => B): F[A] => F[B] = ???
What could possibly go in the body? Well, what information do we have? We know about f. That's it! But somehow, we need this to form a function which will operate on the content of the container. That's to say, we will need map. What will the map method operate on? Whatever is in context when it is invoked, something we don't know yet. That sounds like we need a partially applied function. But this time it is the receiver* of the method, not a parameter that is unknown. Therefore the only possible answer that satisfies our requirements (the signature) is:
def lift[A,B](f: A => B): F[A] => F[B] = _ map f
In particular, we can define the following methods, and many others, inside their respective classes/traits:
def liftOption[A,B](f: A => B): Option[A] => Option[B] = _ map f
def liftSeq[A,B](f: A => B): Seq[A] => Seq[B] = _ map f
Here's another example. I've been working on a type of number that is lazily evaluated when a function is applied to it. This will form part of a fuzzy number suite that I'm developing. I could perhaps use Real from Spire but for now I want to keep it simple without bringing in any external libraries. To begin with, my lazy number class is defined thus:
abstract class LazyNumber[X : Fractional](x: X, f: X=>X) extends Valuable[X] with Fractional[LazyNumber[X]] {
  def get = f(x)
  def construct(x: X, f: X=>X): LazyNumber[X]
  def map(g: X=>X): LazyNumber[X] = construct(x,g.compose(f)) 
...}
where Valuable defines the (abstract) get method, construct is implemented by sub-classes (we could call it unit if we liked). So, the value of a lazy number is f(x) and applying a function g to this lazy number consists of composing function g with f and leaving x as is.  What about flatMap? We'd like to be able to construct for-comprehensions with our lazy numbers, wouldn't we? So, we need a body for the following signature:
def flatMap(g: X=>LazyNumber[X]): LazyNumber[X] = ???
Again, what could possibly go in the body? Well, what do we have? We've got x, f and g. What would be simple, obvious and elegant? How about:
def flatMap(g: X=>LazyNumber[X]): LazyNumber[X] = g(f(x))
Could anything possibly be more obvious, simple or elegant? 

* We could easily get into a debate here: receiver is really an artifact of object-oriented programming -- from a functional programming point of view, the two parameters to the map function (the pre-parameter and the post-parameter) are just parameters.