Friday, November 13, 2015

More on Scala's Parser-combinators: building a concordance

I want to continue my discussion of Scala's parser-combinator library from a previous blog post. The ParseResult type gives us both the value that was parsed and the current state of the input. That Input value (called next in the Success case class that extends ParseResult) allows the parser not only to know what input remains for further parsing, but also can yield the current position in the original text (via the pos method). The pos method is invoked in the Failure case, but not in the Success case, which is normally what you want--once you've parsed your input successfully, you normally don't care where the various phrases actually occurred...

...except if you're goal is to build a concordance. It took me a little while to figure out how to extract the positions of words so I thought it might be helpful to show how it's done.

So, what should the type of my Parser be? The most obvious type seemed to be a Tuple of Int and String: giving us the index of the word and the word itself. And we want to be able to parse an entire String (or file) of many words so we need something like this grammar rule (in extended BNF):
sentence ::= { word }
Where word is defined using a regular expression to suit our needs (see below for specifics). The name sentence might be a bit misleading here because it's not a normal language sentence: it's simply a repetition of words and, in my definition, can include periods, commas, semi-colons, etc. Your needs might be different.

We can turn all this into Scala code using the rules described in my previous post:

  class ConcordanceParser extends RegexParsers {
    val rWord = """[\w’]+[,;\.\-\?\!\—]?""".r
    def word: Parser[String] = regex(rWord)
    def sentence: Parser[Seq[String]] = rep(word)
  }


The problem with this is that our parser still doesn't return the positional information. We need to take advantage of the method positioned that is defined for Parsers. The signature of this method is:

    def positioned[T <: Positional](p: ⇒ Parser[T]): Parser[T]

So, the type that we need for our parser must extend Positional. Well, that's not too hard, although it seems strange that there isn't one already defined. Let's define a case class:

  case class PositionalString(s: String) extends Positional
Now let's substitute PositionalString for String, pass the original word-parser, with result mapped to PositionalString, as the parameter to positioned:

  class ConcordanceParser extends RegexParsers {
    val rWord = """[\w’]+[,;\.\-\?\!\—]?""".r
    def word: Parser[PositionalString] = positioned(regex(rWord) ^^ {w => PositionalString(w)})
    def sentence: Parser[Seq[PositionalString]] = rep(word)
  }
That's it. All you have to decide now is the precise form of regex that you wish to use for the type of documents you want to make a concordance for. And how you want to treat the resulting output from the parser. For the complete source code, please go to: https://github.com/rchillyard/Scalaprof/tree/master/Concordance

[My apologies for the awful formatting, but Blogger seems to have regressed--I simply can't fix it without Blogger putting its bad formatting back]

No comments:

Post a Comment