Scala's parser-combinators appear to be a perfect solution to this general sort of problem. No longer do I have to go looking for open-source libraries to help with parsing. And I can parse functionally without the use of side-effects. SPCs give me all of the goodies that regular expressions can, but the definition of the parser is in Scala -- not just a bunch of special symbols strung together. Not only that but I can convert the matched input to actual types of my choosing.
But finding good documentation on SPCs is really hard. The basics are there in Programming in Scala and there are references to trivial implementations just about everywhere on the web. But these sources simply didn't tell me what I wanted to know in order to implement one and teach about it. The Scala documentation for SPCs is of course the ultimate source but navigating it when you're not quite sure what you're doing is challenging. Given that you will mostly be creating sub-classes of RegexParsers, its API documentation is somewhat helpful (especially the calculator examples). Perhaps the most helpful for detail was Jim McBeath's blog, but he jumps pretty much into the deep end. So here I'd like to present something reasonably brief that covers the basics and ends with an actual implementation.
If you're familiar with the precepts behind functional programming, you'll know that side-effects are a bad thing (they don't allow you to prove your program's correctness and they make testing a lot harder than it should be). So, that rules out having a (mutable) source of characters (or other lexical tokens) on which you make calls to some parse method. Instead, Scala provides three related types: abstract classes Parser[T] and ParseResult[T], in addition to trait Parsers. Of these types:
- Parser[T] encodes logic concerned only with how to parse a source -- it doesn't know anything about the source itself. Parser[T] is actually a function which takes an Input and returns a ParseResult[T].
- ParseResult[T] captures both a result of type T and the new state of the input.
- Parsers provides generic parser combinators. Most usefully it has a subclass called RegexParsers which you will typically extend in order to build your parser application. Not only does this sub-trait define Input specifically as Reader[Char], but it defines a method which brings it all together allowing you to parse a sequence of characters using a specific parser:
- def parseAll[T](parser: Parser[T], in: CharSequence): ParseResult[T]
- def parseAll[T](parser: Parser[T], in: Reader[Char]): ParseResult[T]
There also other forms of parseAll as well as parse (which looks at only a prefix of the input) but this form of parseAll will get you going. There is an even more useful sub-trait called JavaTokenParsers which defines such parsers as floatingPointNumber.
So, and here's where the fun starts, how do we define our own parsers? Let's start by observing how we would define a grammar in BNF:
We can easily convert this into Scala defining three parsers:So, and here's where the fun starts, how do we define our own parsers? Let's start by observing how we would define a grammar in BNF:
expr ::= term { "+" term | "-" term }.
term ::= factor { "*" factor | "/" factor }.
factor ::= floatingPointNumber | "(" expr “)".
def expr: Parser[Any] = term~rep("+"~term|"-"~term);
def term: Parser[Any] = factor~rep("*"~factor|"/"~factor);
def factor: Parser[Any] = floatingPointNumber|"("~expr~")";
- newline -> "def";
- " ::= " -> ": Parser[Any] ="
- " " -> "~"
- "{" -> "rep("
- "}" -> ")"
- "." -> ";" [Of course, we can discard the semi-colons.]
Now, let's wrap these definitions inside our desired class:
import scala.util.parsing.combinator._
class ArithmeticParser extends JavaTokenParsers {
def expr: Parser[Any] = term~rep("+"~term|"-"~term)
class ArithmeticParser extends JavaTokenParsers {
def expr: Parser[Any] = term~rep("+"~term|"-"~term)
def term: Parser[Any] = factor~rep("*"~factor|"/"~factor)
def factor: Parser[Any] = floatingPointNumber|"("~expr~")"
}
val p = new ArithmeticParser
val r = p.parseAll(p.expr,"1+2")
}
val p = new ArithmeticParser
val r = p.parseAll(p.expr,"1+2")
If you look in the Scala API at Parsers again, you will see among the methods, the following:
These are the famed parser combinators and this is how they work:p1 ~ p2 // sequencing: must match p1 followed by p2 p1 | p2 // alternation: must match either p1 or p2, with preference given to p1 p1.? // optionality: may match p1 or not p1.* // repetition: matches any number of repetitions of p1
- If p is a Parser[String], for instance, then the result of p matching String s is simply s.
- (sequencing): if p is a Parser[P] and q is a Parser[Q] then p~q results in an object of type P~Q, which can also be written ~[P,Q]. In other words, assuming that it matches successfully, it will result in both a P and a Q.
- (alternation): if p is a Parser[P] and q is a Parser[Q] then p!q results in an object of type P, or an object of type Q. It attempts to match P first and if that's successful, it doesn't invoke q at all.
- (optionality): if p is a Parser[P] then p.? or opt(p) results in an object of type Option[P].
- (repetition): if p is a Parser[P] then p.* or rep(p) results in an object of type List[P]. There's a similar method, repsep which takes an additional parameter, the separator.
We are getting close! There is one more operator that you need to know about. This operator is a method on Parse[T] and is written ^^. It's essentially a "lift" method because, it takes as its parameter a function (T)=>U (i.e. T is transformed into U) and has the effect of transforming a Parser[T] into a Parser[U]. Thus, if we have written our parsers as Parser[Any] (as we did) and what we really want is a Parser[Double] then we need to create some functions which transform Any into Double and apply this to each of our parsers using the ^^ operator. This is much harder to explain (or understand) than it actually looks so let's dive right into a simple example:
package edu.neu.coe.scala.parse import scala.util.parsing.combinator._ /** * @author scalaprof */ class Arith extends JavaTokenParsers { trait Expression { def value: Double } abstract class Factor extends Expression case class Expr(t: Term, ts: List[String~Term]) extends Expression { def term(t: String~Term): Double = t match {case "+"~x => x.value; case "-"~x => -x.value } def value = ts.foldLeft(t.value)(_ + term(_)) } case class Term(f: Factor, fs: List[String~Factor]) extends Expression { def factor(t: String~Factor): Double = t match {case "*"~x => x.value; case "/"~x => 1/x.value } def value = fs.foldLeft(f.value)(_ * factor(_)) } case class FloatingPoint(x: Any) extends Factor { def value = x match { case x: String => x.toDouble case _ => throw new RuntimeException("FloatingPoint: logic error: x is not a String") } } case class Parentheses(e: Expr) extends Factor { def value = e.value } def expr: Parser[Expr] = term~rep("+"~term | "-"~term | failure("expr")) ^^ { case t~r => r match {case x: List[String~Term] => Expr(t,x)}} def term: Parser[Term] = factor~rep("*"~factor | "/"~factor | failure("term")) ^^ { case f~r => r match {case x: List[String~Factor] => Term(f,x)}} def factor: Parser[Factor] = (floatingPointNumber | "("~expr~")" | failure("factor")) ^^ { case "("~e~")" => e match {case x: Expr => Parentheses(x)}; case s => FloatingPoint(s) } }
Now our test code will look a little different. We can actually test the value of the result as a Double.
val r = p.parseAll(p.expr,"1+2")
r.get.value should matchPattern { case Success(3.0) => }