Monday, April 15, 2019

Scala table parser

I've been busy over the last year with some new Scala projects in my GitHub space. In this blog I will talk about TableParser. The current release is v1.0.5.

TableParser is a CSV file parser which uses type classes to facilitate the task of programming the parser. I have written elsewhere about the advantages of type classes, but in a nutshell, a type class (which is usually defined as a trait with a single parametric type, e.g. trait T[X]) can allow you to create classes which provide functionality which derives from the combination of the type class T and its underlying type X. The totality of such classes, therefore, is the Cartesian product of all type classes T and all underlying (concrete) types X.

TableParser operates on the assumption that each row in a CSV file represents some "thing," which you can model as a case class. Don't worry if your CSV is basically just a matrix of strings--we can handle that too.

But, what if the data in these rows is of disparate types, some Ints, some Doubles, some Strings? For this, we rely on some magic that derives from Scala's ability to perform type inference. This is an aspect of Scala that not generally emphasized enough, in my opinion. Of course, we like the fact that type inference can check the integrity of our program. But it does more than that--it essentially writes code for us!

As an example, let's think about a CSV file which contains a set of daily hawk count observations (i.e. each has a date, a species name, and a count). Just to keep it simple for this explanation, we will ignore the date. We describe our hawk count with a simple case class:
case class HawkCount(species: String, count: Int)
And now we create an object which extends CellParsers as follows:
object HawkCountParser extends CellParsers {
  implicit val hawkCountParser: CellParser[HawkCount] = cellParser2(HawkCount)
}
The only tricky part here was that we had to count up the number of parameters in HawkCount and use the appropriate cell parser (in this case, cellParser2). Then we had to pass in the name of the case class and we have a parser which knows how to read a species and, more significantly, how to read and convert the values in the count column to Ints.

What we are actually passing to cellParser2 is a function which takes two parameters. It is the function of HawkCount called "apply." It is the type inference of the compiler which now allows us to know how to parse the individual fields (parameters) of the case class. If you have created additional apply methods (or simply have a companion object for your type), you will have to explicitly name the apply method that you want (you can do this using the type--see README file).

Now, all we have to do is to parse our file. Something like this...
import HawkCountParser._
val hty = Table.parse("hawkmountain.csv")

Note that the returned value is a Try[Table[HawkCount]]. A Table is a monad and can easily be transformed into another Table using map or flatMap.

Sometimes, there will be too many columns to be grouped logically into one case class. But, no worries, you can set up a hierarchy of case classes. Just make sure that you define the parsers for the inner case classes before they are referenced by an outer case class.

You could simply print your table by invoking foreach and printing each row. However, if you want a little more control and logic to your output, you have two options: a simple "square" rendering, for which you will set up an output type, for example,

implicit object StringBuilderWriteable extends Writable[StringBuilder] {
  override def writeRaw(o: StringBuilder)(x: CharSequence): StringBuilder = o.append(x.toString)
  override def unit: StringBuilder = new StringBuilder
  override def delimiter: CharSequence = "|"} 
hty.map(_.render)

Alternatively, you could write your table out to a hierarchical format, such as XML or HTML.
For more detail, please see the README file.