...except if you're goal is to build a concordance. It took me a little while to figure out how to extract the positions of words so I thought it might be helpful to show how it's done.
So, what should the type of my Parser be? The most obvious type seemed to be a Tuple of Int and String: giving us the index of the word and the word itself. And we want to be able to parse an entire String (or file) of many words so we need something like this grammar rule (in extended BNF):
sentence ::= { word }Where word is defined using a regular expression to suit our needs (see below for specifics). The name sentence might be a bit misleading here because it's not a normal language sentence: it's simply a repetition of words and, in my definition, can include periods, commas, semi-colons, etc. Your needs might be different.
We can turn all this into Scala code using the rules described in my previous post:
class ConcordanceParser extends RegexParsers {
val rWord = """[\w’]+[,;\.\-\?\!\—]?""".r
def word: Parser[String] = regex(rWord)
def sentence: Parser[Seq[String]] = rep(word)
}
}
The problem with this is that our parser still doesn't return the positional information. We need to take advantage of the method positioned that is defined for Parsers. The signature of this method is:
def positioned[T <: Positional](p: ⇒ Parser[T]): Parser[T]
So, the type that we need for our parser must extend Positional. Well, that's not too hard, although it seems strange that there isn't one already defined. Let's define a case class:
case class PositionalString(s: String) extends Positional
class ConcordanceParser extends RegexParsers {
val rWord = """[\w’]+[,;\.\-\?\!\—]?""".r
def word: Parser[PositionalString] = positioned(regex(rWord) ^^ {w => PositionalString(w)})
def sentence: Parser[Seq[PositionalString]] = rep(word)
}
That's it. All you have to decide now is the precise form of regex that you wish to use for the type of documents you want to make a concordance for. And how you want to treat the resulting output from the parser. For the complete source code, please go to: https://github.com/rchillyard/Scalaprof/tree/master/Concordance
[My apologies for the awful formatting, but Blogger seems to have regressed--I simply can't fix it without Blogger putting its bad formatting back]
No comments:
Post a Comment