Monday, April 15, 2019

Scala table parser

I've been busy over the last year with some new Scala projects in my GitHub space. In this blog I will talk about TableParser. The current release is v1.0.5.

TableParser is a CSV file parser which uses type classes to facilitate the task of programming the parser. I have written elsewhere about the advantages of type classes, but in a nutshell, a type class (which is usually defined as a trait with a single parametric type, e.g. trait T[X]) can allow you to create classes which provide functionality which derives from the combination of the type class T and its underlying type X. The totality of such classes, therefore, is the Cartesian product of all type classes T and all underlying (concrete) types X.

TableParser operates on the assumption that each row in a CSV file represents some "thing," which you can model as a case class. Don't worry if your CSV is basically just a matrix of strings--we can handle that too.

But, what if the data in these rows is of disparate types, some Ints, some Doubles, some Strings? For this, we rely on some magic that derives from Scala's ability to perform type inference. This is an aspect of Scala that not generally emphasized enough, in my opinion. Of course, we like the fact that type inference can check the integrity of our program. But it does more than that--it essentially writes code for us!

As an example, let's think about a CSV file which contains a set of daily hawk count observations (i.e. each has a date, a species name, and a count). Just to keep it simple for this explanation, we will ignore the date. We describe our hawk count with a simple case class:
case class HawkCount(species: String, count: Int)
And now we create an object which extends CellParsers as follows:
object HawkCountParser extends CellParsers {
  implicit val hawkCountParser: CellParser[HawkCount] = cellParser2(HawkCount)
The only tricky part here was that we had to count up the number of parameters in HawkCount and use the appropriate cell parser (in this case, cellParser2). Then we had to pass in the name of the case class and we have a parser which knows how to read a species and, more significantly, how to read and convert the values in the count column to Ints.

What we are actually passing to cellParser2 is a function which takes two parameters. It is the function of HawkCount called "apply." It is the type inference of the compiler which now allows us to know how to parse the individual fields (parameters) of the case class. If you have created additional apply methods (or simply have a companion object for your type), you will have to explicitly name the apply method that you want (you can do this using the type--see README file).

Now, all we have to do is to parse our file. Something like this...
import HawkCountParser._
val hty = Table.parse("hawkmountain.csv")

Note that the returned value is a Try[Table[HawkCount]]. A Table is a monad and can easily be transformed into another Table using map or flatMap.

Sometimes, there will be too many columns to be grouped logically into one case class. But, no worries, you can set up a hierarchy of case classes. Just make sure that you define the parsers for the inner case classes before they are referenced by an outer case class.

You could simply print your table by invoking foreach and printing each row. However, if you want a little more control and logic to your output, you have two options: a simple "square" rendering, for which you will set up an output type, for example,

implicit object StringBuilderWriteable extends Writable[StringBuilder] {
  override def writeRaw(o: StringBuilder)(x: CharSequence): StringBuilder = o.append(x.toString)
  override def unit: StringBuilder = new StringBuilder
  override def delimiter: CharSequence = "|"}

Alternatively, you could write your table out to a hierarchical format, such as XML or HTML.
For more detail, please see the README file.

Saturday, March 2, 2019

Functionalizing code

I'm not sure if functionalizing is really a word. But let's suppose it is.

Some time ago, I wrote an application to allow me to view the output of the "Results" option that Blackboard Lean (oops: Learn) provides us. Since most answers use HTML, the content of individual answers can be obscured at best. At worst, impossible.

So, I created a filter that takes the CSV file and turns it into an HTML file with a table (aspects of questions are columns, student submissions are rows). I recently upgraded it to allow me to specify a particular set of columns that I was interested in. But I was very dissatisfied when I looked at the HTML class which I had previously used to set up the output tags. This was how it looked:

/**  * Mutable class to form an HTML string  */
class HTML() {
  val content = new StringBuilder("")
  val tagStack: mutable.Stack[String] = mutable.Stack[String]()

  def tag(w: String): StringBuilder = {

  def unTag: StringBuilder = content.append(s"</${tagStack.pop()}>")

  def append(w: String): StringBuilder = content.append(w)

  def close(): Unit = while (tagStack.nonEmpty) {

  override def toString: String = content.toString + "\n"

And this was how it was used:

def preamble(w: String): String = {
  val result = new HTML
def parseStreamIntoHTMLTable(ws: Stream[String], title: String): String = {
  val result = new HTML
  ws match {
    case header #:: body =>
      result.append(parseRowIntoHTMLRow(header, header = true))
      for (w <- body) result.append(parseRowIntoHTMLRow(w))

The rendering of the code here doesn't actually show that the mutable Stack class is deprecated. How could I be using a deprecated class--and for mutation too! Ugh! Well, it was a utility, not an exemplar for my students. So, it was acceptable.
But I decided to functionalize it. First, I needed to create a trait which had the basic behavior I needed: 

  * Trait Tag to model an Markup Language-type document.
trait Tag {

    * Method to yield the name of this Tag
    * @return the name, that's to say what goes between &lt; and &gt;
  def name: String

    * Method to yield the attributes of this Tag.
    * @return a sequence of attributes, each of which is a String
  def attributes: Seq[String]

    * Method to yield the content of this Tag.
    * @return the content as a String.
  def content: String

    * Method to yield the child Tags of this Tag.
    * @return a Seq of Tags.
  def tags: Seq[Tag]

    * Method to add a child to this Tag
    * @param tag the tag to be added
    * @return a new version of this Tag with the additional tag added as a child
  def :+(tag: Tag): Tag

    * Method to yield the tag names depth-first in a Seq
    * @return a sequence of tag names
  def \\ : Seq[String] = name +: (for (t <- tags; x <- t.\\) yield x)
Together with an abstract base class:
abstract class BaseTag(name: String, attributes: Seq[String], content: String, tags: Seq[Tag])(implicit rules: TagRules) extends Tag {

  override def toString: String = s"""\n${tagString()}$content$tagsString${tagString(true)}"""

  private def attributeString(close: Boolean) = if (close || attributes.isEmpty) "" else " " + attributes.mkString(" ")

  private def tagsString = if (tags.isEmpty) "" else tags mkString ""

  private def nameString(close: Boolean = false) = (if (close) "/" else "") + name

  private def tagString(close: Boolean = false) = s"<${nameString(close)}${attributeString(close)}>"

And a case class for HTML with its companion object:
/** * Case class to model an HTML document. * @param name the name of the tag at the root of the document. * @param attributes the attributes of the tag. * @param content the content of the tag. * @param tags the child tags. * @param rules the "rules" (currently ignored) but useful in the future to validate documents. */ case class HTML(name: String, attributes: Seq[String], content: String, tags: Seq[Tag])(implicit rules: TagRules) extends BaseTag(name, attributes, content, tags) { /** * Method to add a child to this Tag * * @param tag the tag to be added * @return a new version of this Tag with the additional tag added as a child */ override def :+(tag: Tag): Tag = HTML(name, attributes, content, tags :+ tag) } /** * Companion object to HTML */ object HTML { implicit object HtmlRules extends TagRules def apply(name: String, attributes: Seq[String], content: String): HTML = apply(name, attributes, content, Nil) def apply(name: String, attributes: Seq[String]): HTML = apply(name, attributes, "") def apply(name: String): HTML = apply(name, Nil) def apply(name: String, content: String): HTML = apply(name, Nil, content) }
Here's one of the places the tags in the document are set up:
  def parseStreamProjectionIntoHTMLTable(columns: Seq[String], wss: Stream[Seq[String]], title: String): Try[Tag] = Try {
    val table = HTML("table", Seq("""border="1"""")) :+ parseRowProjectionIntoHTMLRow(columns, header = true)
    val body = HTML("body") :+ wss.foldLeft(table)((tag, ws) => tag :+ parseRowProjectionIntoHTMLRow(ws))
    HTML("html") :+ preamble(title) :+ body
Now, all we have to do is to use .toString on the document to render it for the final HTML source!

Wednesday, January 3, 2018

2018, Spark 2.2 and Scala 2.12

Here we are already in 2018. I haven't written much here lately but I have plenty of ideas. I just need time to write them up.

My new BigData/Scala class will begin next week and I have been thinking about what exactly it is about Scala that made it the choice for implementing Spark in the first place. Indeed, I've developed a new module for introducing Scala in that way, i.e. if you were trying to implement Spark, what features would you require in order to be able to do it. Here, for example, is a very promising lode of insights, starting with an answer from Mattei Zaharia, the originator of Spark, himself. But I was disappointed, there was nothing there that really said what Scala brought to the table, other than scalability.

In a sense, I've built my very own map-reduce-based, cluster-computing platform called Majabigwaduce. I thought it needed a less catchy name than Spark to get adopted. Just kidding! No, in truth I simply built an Akka-based map-reduce framework for pedagogical purposes. However, it does essentially share the same requirements as Spark so it's a good starting point to think about this question.

Here's the list that I came up with:
  • lazy evaluation 
  • functional composition 
  • closures 
  • serializable functions 
  • JVM
Let me briefly describe these. Lazy evaluation is important because, when you have a huge potential overhead for starting up a job on a cluster, you need to ensure that you're not doing it eagerly. You want to be lazy--put off as much as possible any actual work until the last moment--then do it all at once.

Another closely-related feature is functional composition. If you can compose two functions into one function, you have the potential of avoiding some overhead. For example, you have a million objects to process. You apply function f to each object and then you similarly apply function g. This requires two traverses of the million objects. But suppose that you can create a function h which has the same effect as applying f followed by g. Then you only require one traversal. This may not sound like much but, in practice, the full benefit of lazy evaluation would be impossible without functional composition.

Closures are functions which are partially applied such that only the expected input parameters are unbound. Strictly speaking, closures aren't absolutely necessary because programmers can always explicitly capture the other variables--or add them as input parameters. But they make programming so much more logical and elegant that they are almost essential.

Serializable functions: we need to be able to send "expressions" as objects over the net to the executors. This requires, at a minimum, the ability to consider said expression as an object, e.g. as a function. This further requires that a function can be represented as a serializable object. In Scala 2.11 and before, that serializable object is actually a Java Class<?> object. See my own question and answer on this in StackOverflow.

The JVM: this isn't 100% required but it is the ecosystem where the vast majority of big data work is carried out. So, it would be likely a huge tactical error to create a cluster computing platform that didn't run on the JVM. Nobody would be interested in it.

At the time that work on Spark was begun in 2009, Java was not even close to having any of these required features (other than running on the JVM). Scala had them all so it was the natural choice with which to implement Spark at the time. Java8 (released in 2014) (and later) has the third and fourth items on the list, but I don't think it can yet provide solutions for the first two. I could be wrong about that.

If you delved into my StackOverflow Q+A referenced above, you might have noticed one little fly in the ointment. Spark 2.2 cannot currently run with Scala 2.12. This is ironically because Scala 2.12 chose to reimplement its lambda (anonymous functions) as Java8 lambdas which you would think a good idea. But, apparently, some of the "closure cleaning" function that is necessary to ensure a lambda is serializable is no longer available using that implementation and so we are waiting for a fix. Spark has been historically very slow to adopt new versions of Scala but this one seems to be particularly irritating. It certainly means that I have to teach my Scala class that yes, they should get used to using Scala 2.12 for all of the assignments, but when we get to the Spark assignment, they will have to explicitly require Scala 2.11. I look forward to a fix coming soon.

Monday, October 30, 2017

Making comparisons more functional

In Scala 2.7 we got Ordering, a type class that essentially made Ordered, the original Scala extension of Java's Comparable, somewhat obsolete. Ordering (and type classes in general) is great, of course, and a big improvement. But what a shame they didn't go all the way and have Ordering provide a much more functional way of comparing.

Let me explain what I mean. Take any Java/Scala class with at least two fields in it. Implementing the comparison function, whether in Scala or Java is typically going to look like this (from the source of

int compareTo0(LocalDate otherDate) {
    int cmp = (year - otherDate.year);
    if (cmp == 0) {
        cmp = (month - otherDate.month);
        if (cmp == 0) {
            cmp = (day -;
    return cmp;

Do you see how ugly (and non-functional) that code is? It's fine in Java, because Java code expects to be a mix of assignments, if clauses, loops, etc. But Scala code expects to be expression-oriented, not statement-oriented. And there's no reason not to do it better in Scala, because we can create a type which can represent either equality, or difference. To be fair, Java could do this too, but that ship sailed long ago.

So, how about this type?--Comparison--a trait which extends Function0[Option[Boolean]]:

sealed trait Comparison extends (() => Option[Boolean]) {
  override def toString(): String = apply.toString
  def toInt: Int = apply match {
    case Some(b) => if (b) -1 else 1;
    case _ => 0  }
  def orElse(c: => Comparison): Comparison = Comparison(apply.orElse(c()))
  def flip: Comparison = Comparison(for (v <- apply) yield !v)
case class Different(less: Boolean) extends Comparison {
  def apply: Option[Boolean] = Some(less)
case object Same extends Comparison {
  def apply: Option[Boolean] = None
object Comparison {
  val more = Different(false)
  val less = Different(true)
  def apply(x: Option[Boolean]): Comparison = x match {
    case Some(b) => Different(b);
    case _ => Same
  def apply(x: Int): Comparison = x match {
    case 0 => Same;
    case _ => Comparison(Some(x < 0))

Thus, invoking the apply method of Comparison can yield one of three different instances: Some(true), Some(false), and None. These are quite sufficient to represent the result of a comparison of two objects.

Now that we have a way of expressing the result of a comparison, we can create a comparer type (note that the names Comparable, Comparator, Ordered, and Ordering are already in use). Let's choose the name as Orderable. Of course, like Ordering, we would like it to be a type class (actually we'd like it to replace Ordering). So we need another trait (unsealed, this time) like the following:

trait Orderable[T] extends (((T, T)) => Comparison) {
  self =>
  def toOrdering: Ordering[T] = new Ordering[T]() {
    def compare(x: T, y: T): Int = self(x, y).toInt
  def >(tt: (T, T)): Boolean = apply(tt.swap)().getOrElse(false)
  def <(tt: (T, T)): Boolean = apply(tt)().getOrElse(false)
  def ==(tt: (T, T)): Boolean = apply(tt)().isEmpty
  def >=(tt: (T, T)): Boolean = ! <(tt)
  def <=(tt: (T, T)): Boolean = ! >(tt)
  def !=(tt: (T, T)): Boolean = ! ==(tt)
  def compose(f: Comparison => Comparison): Orderable[T] = new Orderable[T]() {
    def apply(tt: (T, T)): Comparison = f(self(tt))
  def orElse(o: Orderable[T]): Orderable[T] = new Orderable[T]() {
    def apply(tt: (T, T)): Comparison = self(tt).orElse(o(tt))
  def invert: Orderable[T] = compose(_ flip)
object Orderable {
  implicit val intOrderable: Orderable[Int] = Ordering[Int]
  implicit val strOrderable: Orderable[String] = Ordering[String]
  implicit def convert[T](x: Ordering[T]): Orderable[T] = new Orderable[T] {
    def apply(tt: (T, T)) = Comparison(, tt._2))
Note that, as Ordering has implicit converters to/from Ordered, so Orderable has explicit/implicit converters to/from Ordering, although perhaps it would be better to have them to/from Ordered.

There is one other minor difference here. Since we always implement the comparison function (here, it is the apply method) on a pair of T elements, we might as well make the input a Tuple2 of T. We can always convert to untupled form when necessary.

Note also the compose method which allows us to create a new Orderable based on this Orderable and a function from Comparison to Comparison.

And, in the following, we can see how the whole mechanism allows us to create a lazy sorted object, which only does an actual sort when the result is required:

case class Sorted[T](ts: Seq[T])(implicit f: Orderable[T]) extends (() => Seq[T]) {
  implicit val ordering: Ordering[T] = f.toOrdering
  def sort(o: Orderable[T]): Sorted[T] = Sorted(ts)(f orElse o)
  def apply: Seq[T] = ts.sorted
  def async(implicit ec: ExecutionContext): Future[Seq[T]] = Future(apply)
  def parSort(implicit ec: ExecutionContext): Future[Seq[T]] = Sorted.mergeSort(ts)
object Sorted {
  def create[T: Ordering](ts: Seq[T]): Sorted[T] = Sorted(ts)(implicitly[Ordering[T]])
  def verify[T: Orderable](xs: Seq[T]): Boolean = => implicitly[Orderable[T]].<=(z._1,z._2))
  def parSort[T: Ordering](tst: (Seq[T], Seq[T]))(implicit ec: ExecutionContext): Future[Seq[T]] = map2(Future(tst._1.sorted), Future(tst._2.sorted))(merge)
  def mergeSort[T: Ordering](ts: Seq[T])(implicit ec: ExecutionContext): Future[Seq[T]] = parSort(ts splitAt (ts.length/2))
  def merge[T: Ordering](ts1: Seq[T], ts2: Seq[T]): Seq[T] = {
    val ordering = implicitly[Ordering[T]]
    @tailrec def inner(r: Seq[T], xs: Seq[T], ys: Seq[T]): Seq[T] = (xs, ys) match {
      case (_, Nil) => r ++ xs
      case (Nil, _) => r ++ ys
      case (x :: xs1, y :: ys1) =>
        if (, y)) inner(r :+ x, xs1, ys)
        else inner(r :+ y, xs, ys1)
    inner(Nil, ts1, ts2)
  def map2[T: Ordering](t1f: Future[Seq[T]], t2f: Future[Seq[T]])(f: (Seq[T], Seq[T]) => Seq[T])(implicit ec: ExecutionContext): Future[Seq[T]] = for {t1 <- t1f; t2 <- t2f} yield f(t1, t2)
If you're interested to see a specification file, or an App for this, you can see find them in LaScala. under the package com/phasmid/laScala/sort.

Wednesday, June 21, 2017

Some observations on foldLeft with reference to short-circuiting

Have you ever written code such as the following and expected it to quit accessing values of y after it hits the first true value?

    val y = Seq[Boolean](false, true, ..., false)
    val x = y.foldLeft[Boolean](false)(_ || _)

After all, the second parameter of || is call-by-name so, if it's not needed to determine the outcome of the value for x, then that should mean that the recursion of the foldLeft should stop, right?

But it doesn't in practice, and if you look at the definition of foldLeft, you can see why. There is no short-circuit defined.

I therefore created a method in the FP module of my LaScala library to be the short-circuit equivalent of foldLeft. It's called foldLeftShort and it looks like this:

    * This method performs a foldLeft with short-circuit logic
    * @param xi an iterator of X objects
    * @param y0 the initial value of Y (e.g. 0 if Y is Int)
    * @param q  the quick-return function: if this evaluates to true, given the new y value, that value is immediately returned
    * @param f  the function which accumulates the result: as with foldLeft, it takes a Y and an X and yields a Y
    * @tparam X the underlying type of the iterator
    * @tparam Y the return type
    * @return the aggregate of the given iterator, according to the initial value of y0 and the accumulator function f.
  def foldLeftShort[X, Y](xi: Iterator[X], y0: Y, q: Y => Boolean)(f: (Y, X) => Y): Y = {
    def inner(_y: Y, xo: => Option[X]): Y =
      xo match {
        case None => _y
        case Some(x) =>
          val y = f(_y, x)
          if (q(y)) y
            inner(y, nextOption(xi))

    inner(y0, nextOption(xi))

    * Method to get the next element from an iterator, as an Option.
    * @param xi the iterator.
    * @tparam X the underlying type.
    * @return Some(x) where x is the next element; or None where the iterator is exhausted.
  def nextOption[X](xi: Iterator[X]): Option[X] = if (xi.hasNext) Some( else None

As with foldLeft, the accumulating function is in its own parameter set. But the initial value is joined by the iterator itself and the short-circuit function in the first parameter set. It would be nice to be able to extend the Iterator type to include this.

The inner function isn't defined in the typical way, using the concatenation unapply method (::) because, for some reason that I don't quite understand, you cannot decompose an Iterator this way. Instead, we make use of another FP method: nextOption which returns the next element of the iterator as an Option (see above).

Note the short-circuit mechanism, which really does work. As soon as we have a value by applying f to the current accumulator and the newest value from the iterator, we return if f evaluates to true.

In order to use this as efficiently, as possible, we need to write code something like the following where in this example, we want to save doing a trivial lookup too many times. That's not worth the effort--but just suppose that the lookup was actually an expensive database query instead:

    val lookupTable = Map("a"->false, "b"->true)
    val boi = => lookupTable.get(x)).toIterator
    FP.foldLeftShort[Option[Boolean], Boolean](boi, false, b => b)(_ || _.getOrElse(false))

Obviously, if you want to end up with an Option[Boolean], you would write something like this instead:

FP.foldLeftShort[Option[Boolean], Option[Boolean]](boi, Some(false), b => b.getOrElse(false))(optionalOr)

    * NOTE: that we can't use map2 here because we are essentially implementing three-valued (Kleenean) logic
  * @param b1 the accumulator: if b2 is None or Some(false) then b1 is returned unchanged
  * @param b2 the value: if b2 is Some(true) then Some(true) will be returned
  * @return an Optional[Boolean]
def optionalOr(b1: Option[Boolean], b2: Option[Boolean]): Option[Boolean] =
    b2 match {
      case Some(x) => for (b <- b1) yield b || x
      case None => b1

Note that, if you use map2(_ || _) instead of optionalOr, the logic is probably not going to give you exactly what you want, because None will combine with false to give None, whereas you really want None to have no effect on the accumulator value at all.

Saturday, June 17, 2017

Another cautionary tale of Spark and serialization

In my earlier post on the subject of Spark and Serialization, I forgot one big gotcha that has come up a couple of times for me (I thought I had mentioned it, but now I see that I didn't).

The problem is with Stream and other non-strict collections such as Iterator or SeqView. These are inherently unserializable (except for empty collections). I'm sure there are several posts on this in StackOverflow, although I couldn't actually find one when I searched just now. But, I hear you say, these are never the kind of thing you would want to store as a (non-transient) field in a class.

That's true of course, but you might inadvertently include one, as I did just recently. Let's say you have a case class like this:

case class Person(name: String, friends: Seq[String])

You create instances of this class by running a database query on the person and getting his/her friends as a collection of strings. Something like this:

val x = sqlContext.sql("select ...").collect
val p = Person("Ronaldo", x.toSeq)

The problem here, is that, if the type of x is Stream[String], then x.toSeq will not change its type since Stream extends Seq. You need to write x.toList instead (or x.toVector).

Actually, it's much better practice never to use Seq for persistable/serializable objects. Far better to force an explicit collection type in the definition of the case class. In particular, you should use either List or Vector, unless you have some specific reason to use a different sequence. In this case, that would be:

case class Person(name: String, friends: List[String])