Monday, April 15, 2019

Scala table parser

I've been busy over the last year with some new Scala projects in my GitHub space. In this blog I will talk about TableParser. The current release is v1.0.5.

TableParser is a CSV file parser which uses type classes to facilitate the task of programming the parser. I have written elsewhere about the advantages of type classes, but in a nutshell, a type class (which is usually defined as a trait with a single parametric type, e.g. trait T[X]) can allow you to create classes which provide functionality which derives from the combination of the type class T and its underlying type X. The totality of such classes, therefore, is the Cartesian product of all type classes T and all underlying (concrete) types X.

TableParser operates on the assumption that each row in a CSV file represents some "thing," which you can model as a case class. Don't worry if your CSV is basically just a matrix of strings--we can handle that too.

But, what if the data in these rows is of disparate types, some Ints, some Doubles, some Strings? For this, we rely on some magic that derives from Scala's ability to perform type inference. This is an aspect of Scala that not generally emphasized enough, in my opinion. Of course, we like the fact that type inference can check the integrity of our program. But it does more than that--it essentially writes code for us!

As an example, let's think about a CSV file which contains a set of daily hawk count observations (i.e. each has a date, a species name, and a count). Just to keep it simple for this explanation, we will ignore the date. We describe our hawk count with a simple case class:
case class HawkCount(species: String, count: Int)
And now we create an object which extends CellParsers as follows:
object HawkCountParser extends CellParsers {
  implicit val hawkCountParser: CellParser[HawkCount] = cellParser2(HawkCount)
}
The only tricky part here was that we had to count up the number of parameters in HawkCount and use the appropriate cell parser (in this case, cellParser2). Then we had to pass in the name of the case class and we have a parser which knows how to read a species and, more significantly, how to read and convert the values in the count column to Ints.

What we are actually passing to cellParser2 is a function which takes two parameters. It is the function of HawkCount called "apply." It is the type inference of the compiler which now allows us to know how to parse the individual fields (parameters) of the case class. If you have created additional apply methods (or simply have a companion object for your type), you will have to explicitly name the apply method that you want (you can do this using the type--see README file).

Now, all we have to do is to parse our file. Something like this...
import HawkCountParser._
val hty = Table.parse("hawkmountain.csv")

Note that the returned value is a Try[Table[HawkCount]]. A Table is a monad and can easily be transformed into another Table using map or flatMap.

Sometimes, there will be too many columns to be grouped logically into one case class. But, no worries, you can set up a hierarchy of case classes. Just make sure that you define the parsers for the inner case classes before they are referenced by an outer case class.

You could simply print your table by invoking foreach and printing each row. However, if you want a little more control and logic to your output, you have two options: a simple "square" rendering, for which you will set up an output type, for example,

implicit object StringBuilderWriteable extends Writable[StringBuilder] {
  override def writeRaw(o: StringBuilder)(x: CharSequence): StringBuilder = o.append(x.toString)
  override def unit: StringBuilder = new StringBuilder
  override def delimiter: CharSequence = "|"} 
hty.map(_.render)

Alternatively, you could write your table out to a hierarchical format, such as XML or HTML.
For more detail, please see the README file.

Saturday, March 2, 2019

Functionalizing code

I'm not sure if functionalizing is really a word. But let's suppose it is.

Some time ago, I wrote an application to allow me to view the output of the "Results" option that Blackboard Lean (oops: Learn) provides us. Since most answers use HTML, the content of individual answers can be obscured at best. At worst, impossible.

So, I created a filter that takes the CSV file and turns it into an HTML file with a table (aspects of questions are columns, student submissions are rows). I recently upgraded it to allow me to specify a particular set of columns that I was interested in. But I was very dissatisfied when I looked at the HTML class which I had previously used to set up the output tags. This was how it looked:

/**  * Mutable class to form an HTML string  */
class HTML() {
  val content = new StringBuilder("")
  val tagStack: mutable.Stack[String] = mutable.Stack[String]()

  def tag(w: String): StringBuilder = {
    tagStack.push(w)
    content.append(s"<$w>")
  }

  def unTag: StringBuilder = content.append(s"</${tagStack.pop()}>")

  def append(w: String): StringBuilder = content.append(w)

  def close(): Unit = while (tagStack.nonEmpty) {
    unTag
  }

  override def toString: String = content.toString + "\n"
}


And this was how it was used:


def preamble(w: String): String = {
  val result = new HTML
  result.tag("head")
  result.tag("title")
  result.append(w)
  result.close()
  result.toString
}
def parseStreamIntoHTMLTable(ws: Stream[String], title: String): String = {
  val result = new HTML
  result.tag("html")
  result.append(preamble(title))
  result.tag("body")
  result.tag("table")
  ws match {
    case header #:: body =>
      result.append(parseRowIntoHTMLRow(header, header = true))
      for (w <- body) result.append(parseRowIntoHTMLRow(w))
  }
  result.close()
  result.toString
}
 

The rendering of the code here doesn't actually show that the mutable Stack class is deprecated. How could I be using a deprecated class--and for mutation too! Ugh! Well, it was a utility, not an exemplar for my students. So, it was acceptable.
 
But I decided to functionalize it. First, I needed to create a trait which had the basic behavior I needed: 


 
/**
  * Trait Tag to model an Markup Language-type document.
  */
trait Tag {

  /**
    * Method to yield the name of this Tag
    *
    * @return the name, that's to say what goes between &lt; and &gt;
    */
  def name: String

  /**
    * Method to yield the attributes of this Tag.
    *
    * @return a sequence of attributes, each of which is a String
    */
  def attributes: Seq[String]

  /**
    * Method to yield the content of this Tag.
    *
    * @return the content as a String.
    */
  def content: String

  /**
    * Method to yield the child Tags of this Tag.
    *
    * @return a Seq of Tags.
    */
  def tags: Seq[Tag]

  /**
    * Method to add a child to this Tag
    *
    * @param tag the tag to be added
    * @return a new version of this Tag with the additional tag added as a child
    */
  def :+(tag: Tag): Tag

  /**
    * Method to yield the tag names depth-first in a Seq
    *
    * @return a sequence of tag names
    */
  def \\ : Seq[String] = name +: (for (t <- tags; x <- t.\\) yield x)
}
Together with an abstract base class:
abstract class BaseTag(name: String, attributes: Seq[String], content: String, tags: Seq[Tag])(implicit rules: TagRules) extends Tag {

  override def toString: String = s"""\n${tagString()}$content$tagsString${tagString(true)}"""

  private def attributeString(close: Boolean) = if (close || attributes.isEmpty) "" else " " + attributes.mkString(" ")

  private def tagsString = if (tags.isEmpty) "" else tags mkString ""

  private def nameString(close: Boolean = false) = (if (close) "/" else "") + name

  private def tagString(close: Boolean = false) = s"<${nameString(close)}${attributeString(close)}>"
}

And a case class for HTML with its companion object:
/** * Case class to model an HTML document. * @param name the name of the tag at the root of the document. * @param attributes the attributes of the tag. * @param content the content of the tag. * @param tags the child tags. * @param rules the "rules" (currently ignored) but useful in the future to validate documents. */ case class HTML(name: String, attributes: Seq[String], content: String, tags: Seq[Tag])(implicit rules: TagRules) extends BaseTag(name, attributes, content, tags) { /** * Method to add a child to this Tag * * @param tag the tag to be added * @return a new version of this Tag with the additional tag added as a child */ override def :+(tag: Tag): Tag = HTML(name, attributes, content, tags :+ tag) } /** * Companion object to HTML */ object HTML { implicit object HtmlRules extends TagRules def apply(name: String, attributes: Seq[String], content: String): HTML = apply(name, attributes, content, Nil) def apply(name: String, attributes: Seq[String]): HTML = apply(name, attributes, "") def apply(name: String): HTML = apply(name, Nil) def apply(name: String, content: String): HTML = apply(name, Nil, content) }
Here's one of the places the tags in the document are set up:
  def parseStreamProjectionIntoHTMLTable(columns: Seq[String], wss: Stream[Seq[String]], title: String): Try[Tag] = Try {
    val table = HTML("table", Seq("""border="1"""")) :+ parseRowProjectionIntoHTMLRow(columns, header = true)
    val body = HTML("body") :+ wss.foldLeft(table)((tag, ws) => tag :+ parseRowProjectionIntoHTMLRow(ws))
    HTML("html") :+ preamble(title) :+ body
  }
Now, all we have to do is to use .toString on the document to render it for the final HTML source!