Friday, October 30, 2015

Scala's Parser-combinators

There have been many situations where I've had to come up with some sort of parser in order to manipulate objects that can be defined by some domain-specific language (DSL). Examples include specifying rules and aggregators for data ingest; mathematical expression parsers for a fuzzy calculator or for dynamically specifying such things as fitness functions for genetic algorithms; bridge hand notations; reading and interpreting queries for information systems using SQL-like or other grammars; JSON, XML and other deserialization. I've probably written dozens of them in various different languages. I've learned (and forgotten) more than I ever really wanted to know about ASTs and regular expressions.

Scala's parser-combinators appear to be a perfect solution to this general sort of problem. No longer do I have to go looking for open-source libraries to help with parsing. And I can parse functionally without the use of side-effects. SPCs give me all of the goodies that regular expressions can, but the definition of the parser is in Scala -- not just a bunch of special symbols strung together. Not only that but I can convert the matched input to actual types of my choosing.

But finding good documentation on SPCs is really hard. The basics are there in Programming in Scala and there are references to trivial implementations just about everywhere on the web. But these sources simply didn't tell me what I wanted to know in order to implement one and teach about it. The Scala documentation for SPCs is of course the ultimate source but navigating it when you're not quite sure what you're doing is challenging. Given that you will mostly be creating sub-classes of RegexParsers, its API documentation is somewhat helpful (especially the calculator examples). Perhaps the most helpful for detail was Jim McBeath's blog, but he jumps pretty much into the deep end. So here I'd like to present something reasonably brief that covers the basics and ends with an actual implementation.

If you're familiar with the precepts behind functional programming, you'll know that side-effects are a bad thing (they don't allow you to prove your program's correctness and they make testing a lot harder than it should be). So, that rules out having a (mutable) source of characters (or other lexical tokens) on which you make calls to some parse method. Instead, Scala provides three related types: abstract classes Parser[T] and ParseResult[T], in addition to trait Parsers. Of these types:
  • Parser[T] encodes logic concerned only with how to parse a source -- it doesn't know anything about the source itself. Parser[T] is actually a function which takes an Input and returns a ParseResult[T].
  • ParseResult[T] captures both a result of type T and the new state of the input.
  • Parsers provides generic parser combinators. Most usefully it has a subclass called RegexParsers which you will typically extend in order to build your parser application. Not only does this sub-trait define Input specifically as Reader[Char], but it defines a method which brings it all together allowing you to parse a sequence of characters using a specific parser:
    • def parseAll[T](parser: Parser[T], in: CharSequence): ParseResult[T]
    • def parseAll[T](parser: Parser[T], in: Reader[Char]): ParseResult[T]
There also other forms of parseAll as well as parse (which looks at only a prefix of the input) but this form of parseAll will get you going. There is an even more useful sub-trait called JavaTokenParsers which defines such parsers as floatingPointNumber.

So, and here's where the fun starts, how do we define our own parsers? Let's start by observing how we would define a grammar in BNF:

expr ::= term { "+" term | "-" term }.
term ::= factor { "*" factor | "/" factor }.
factor ::= floatingPointNumber | "(" expr “)".
We can easily convert this into Scala defining three parsers:

def expr: Parser[Any] = term~rep("+"~term|"-"~term);
def term: Parser[Any] = factor~rep("*"~factor|"/"~factor);
def factor: Parser[Any] = floatingPointNumber|"("~expr~")";
The substitutions we used are:
  • newline -> "def";
  • " ::= " -> ": Parser[Any] ="
  • " " -> "~"
  • "{" -> "rep("
  • "}" -> ")"
  • "." -> ";" [Of course, we can discard the semi-colons.]
Now, let's wrap these definitions inside our desired class:

import scala.util.parsing.combinator._
class ArithmeticParser extends JavaTokenParsers {
  def expr: Parser[Any] = term~rep("+"~term|"-"~term)
  def term: Parser[Any] = factor~rep("*"~factor|"/"~factor)
  def factor: Parser[Any] = floatingPointNumber|"("~expr~")"
}
val p = new ArithmeticParser
val r = p.parseAll(p.expr,"1+2")
The type of r is a ParseResult[Any] which isn't particularly useful to us as is. But, we can see what can be done with it from the following ScalaTest code:


r should matchPattern { case p.Success(_,_) => } r will be an instance of one of the subclasses of ParseResult: either Success, Failure, or Error. But you cannot match on Success as such (it is not imported by the import statement shown): instead you must refer to it as p.Success where p is your parser. Don't confuse it of course with the Success subclass of Try or Future. The get method of ParseResult[Any] returns an Any which is basically the concatenation of the matched tokens. What we really want is to be able to get something out of the parser(s) that conforms to a type of our choosing, such as a Double, for instance, though if we are building a DSL for Xs, we would want to get back an X, whatever that might be.

If you look in the Scala API at Parsers again, you will see among the methods, the following:
p1 ~ p2 // sequencing: must match p1 followed by p2
p1 | p2 // alternation: must match either p1 or p2, with preference given to p1
p1.?    // optionality: may match p1 or not
p1.*    // repetition: matches any number of repetitions of p1
These are the famed parser combinators and this is how they work:
  • If p is a Parser[String], for instance, then the result of p matching String s is simply s.
  • (sequencing): if p is a Parser[P] and q is a Parser[Q] then p~q results in an object of type P~Q, which can also be written ~[P,Q]. In other words, assuming that it matches successfully, it will result in both a P and a Q.
  • (alternation): if p is a Parser[P] and q is a Parser[Q] then p!q results in an object of type P, or an object of type Q. It attempts to match P first and if that's successful, it doesn't invoke q at all.
  • (optionality):  if p is a Parser[P] then p.? or opt(p) results in an object of type Option[P].
  • (repetition):  if p is a Parser[P] then p.* or rep(p) results in an object of type List[P]. There's a similar method, repsep which takes an additional parameter, the separator.
We are getting close! There is one more operator that you need to know about. This operator is a method on Parse[T] and is written ^^. It's essentially a "lift" method because, it takes as its parameter a function (T)=>U (i.e. T is transformed into U) and has the effect of transforming a Parser[T] into a Parser[U]. Thus, if we have written our parsers as Parser[Any] (as we did) and what we really want is a Parser[Double] then we need to create some functions which transform Any into Double and apply this to each of our parsers using the ^^ operator. This is much harder to explain (or understand) than it actually looks so let's dive right into a simple example:

package edu.neu.coe.scala.parse
import scala.util.parsing.combinator._
/**
 * @author scalaprof
 */
class Arith extends JavaTokenParsers {
  trait Expression {
    def value: Double
  }
  abstract class Factor extends Expression
  case class Expr(t: Term, ts: List[String~Term]) extends Expression {
    def term(t: String~Term): Double = t match {case "+"~x => x.value; case "-"~x => -x.value }
    def value = ts.foldLeft(t.value)(_ + term(_))
  }
  case class Term(f: Factor, fs: List[String~Factor]) extends Expression {
    def factor(t: String~Factor): Double = t match {case "*"~x => x.value; case "/"~x => 1/x.value }
    def value = fs.foldLeft(f.value)(_ * factor(_))
  }
  case class FloatingPoint(x: Any) extends Factor {
    def value = x match {
      case x: String => x.toDouble
      case _ => throw new RuntimeException("FloatingPoint: logic error: x is not a String")
    }
  }
  case class Parentheses(e: Expr) extends Factor {
    def value = e.value
  }
  def expr: Parser[Expr] = term~rep("+"~term | "-"~term | failure("expr")) ^^ { case t~r => r match {case x: List[String~Term] => Expr(t,x)}}
  def term: Parser[Term] = factor~rep("*"~factor | "/"~factor | failure("term")) ^^ { case f~r => r match {case x: List[String~Factor] => Term(f,x)}} 
  def factor: Parser[Factor] = (floatingPointNumber | "("~expr~")" | failure("factor")) ^^ { case "("~e~")" => e match {case x: Expr => Parentheses(x)}; case s => FloatingPoint(s) }
}
Now our test code will look a little different. We can actually test the value of the result as a Double.
val p = new ArithmeticParser
val r = p.parseAll(p.expr,"1+2")
r should matchPattern { case p.Success(_,_) => }
r.get.value should matchPattern { case Success(3.0) => }
That's really all you need to know. It is fairly easy to make this a bit more robust by wrapping the arithmetic operations in Try and to make the types generic. Maybe I will post another blog showing how that's done.