Thursday, September 29, 2016

Installation of PredictionIO on MacOS

I'm kicking the tires of PredictionIO, an open-source machine learning server. I'm hoping that this might be a great platform for my BigData/Scala class. It looks really promising.

However, kicking the tires is a lot easier at the car dealership. In order to get something like this running, there are many hurdles to jump. It would no doubt be easier if I was running on Windows, or if I wanted to install to virtual box or via Docker or something like that. The manual installation process for Mac took me several hours to crack as the error messages do not make it clear what the problem is. It's not unusual for an early-stage (incubating) project to have documentation/installation issues. On the whole, they've done an outstanding job. But, in case anyone else out there is having similar problems, here are my fixes to the installation docs. For the Mac (or linux), the installation procedure I followed is at Installing Apache PredictionIO (incubating) on Linux / Mac OS X.

PredictionIO is based on Spark, as well as some combination of storage solutions. You can use Postgres, MySQL or a combination of HBase and ElasticSearch. I chose the latter as it appears to be the recommended solution.

After following the instructions, including editing the conf/pio-env.sh
file to specify my own existing Spark installation, by editing SPARK_HOME as follows:

  SPARK_HOME=Applications/spark-1.6.1

I started the PIO suite using $PIO_HOME/bin/pio-start-all, which apparently went OK.

However, the first problem I ran into resulted in the following cryptic error message (when checking the status via pio status):

/Applications/spark-1.6.1 is probably an Apache Spark development tree. Please make sure you are using at least 1.3.0.

It turns out that you need a version of Spark pre-built with Hadoop (I had built this one from source). Before realizing that was the problem, I simply downloaded their recommended version into the vendors directory (rather than trying to put it in my own area) and edited the pio-env.sh file:

  SPARK_HOME=$PIO_HOME/vendors/spark-1.5.1-bin-hadoop2.6

Now, I tried it again. Again, getting the status showed a problem:

[ERROR] [Console$] Unable to connect to all storage backends successfully. The following shows the error message from the storage backend.
[ERROR] [Console$] Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections. (org.postgresql.util.PSQLException)

It isn't enough to set up the HBase and ElasticSearch configuration as instructed by the installation guide. You also have comment out the Postgres and MySQL variables and set up the PIO_STORAGE_REPOSITORIES_xxx variables as follows:

PIO_STORAGE_REPOSITORIES_METADATA_NAME=predictionio_metadata
PIO_STORAGE_REPOSITORIES_METADATA_SOURCE=ELASTICSEARCH

PIO_STORAGE_REPOSITORIES_EVENTDATA_NAME=predictionio_eventdata
PIO_STORAGE_REPOSITORIES_EVENTDATA_SOURCE=HBASE

PIO_STORAGE_REPOSITORIES_MODELDATA_NAME=pio_

PIO_STORAGE_REPOSITORIES_MODELDATA_SOURCE=LOCALFS

Then don't forget to configure local storage...

PIO_STORAGE_SOURCES_LOCALFS_TYPE=localfs
PIO_STORAGE_SOURCES_LOCALFS_PATH=$PIO_FS_BASEDIR/models

You also need to configure the PIO_STORAGE_SOURCES_xxx variables but the installation instructions already tell you to do that (I think).

Restarting and checking the status generated this error:

[ERROR] [RecoverableZooKeeper] ZooKeeper exists failed after 1 attempts

This took me a while to figure out. I actually installed ZooKeeper, although it turns out that I didn't need to -- and in fact it's incorrect to have ZooKeeper running if you're simply running HBase locally. The problem turned out to be that I hadn't properly configured HBase in the vendors/hbase-1.0.0/conf/hbase-site.xml configuration file. Look carefully at the values that you will find here. They do not match your setup!

Once I had fixed that problem, and rechecked the status, I saw the following wonderful message:

[INFO] [Console$] Your system is all ready to go.

You can also check that all is well by running jps. You should see something like the following:

41109 Console
41014 Elasticsearch
41079 HMaster
41127 Jps
29769 

If you don't see HMaster included, then HBase is not running correctly.

Now, it's time to clone a template. I'm interested in recommender systems so I navigated to the following set of instructions: which state in section 2. Create a new Engine from an Engine Template that you should run the following command:

pio template get PredictionIO/template-scala-parallel-recommendation MyRecommendation

However, that is (currently at least) incorrect. Instead, use the following:

pio template get PredictionIO/template-scala-parallel-universal-recommendation MyUniversalRecommendation

Once I figured out the correct mumbo-jumbo to clone the template, I was able to create my own recommendation engine. Now, I shall (hopefully) enjoy actually doing some implementation!

Stay tuned in case I have more tips.

Update 9/30/2016: there's one more thing you have to do, at least for the universal recommendation engine, and which is not spelled out in the installation instructions. Before you can import the sample data, you must install the Python predictionio package. Do this by entering the following:

pip install prediction

If that doesn't work for you, then consult this github project: https://github.com/apache/incubator-predictionio-sdk-python

One more problem which arose during the training phase:

[INFO] [Engine$] Data sanity check is on.
Exception in thread "main" java.lang.NumberFormatException: For input string: "558:feed::1"

at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

The solution this time is something that I never would have imagined in a million years, let alone solved! Here's the link to the issue. But the bottom line is, if you're running on a Mac, change the TCP/IP wifi network settings so that Configure IPv6 is manual, not automatic.

And another problem after deploying and trying to make a query:

[ERROR] [OneForOneStrategy] Cannot support TLS_RSA_WITH_AES_256_CBC_SHA with currently installed providers

Here's a discussion of the issue.The solution is to upgrade your Java encryption policy to unlimited. Follow the instructions in the discussion. Ensure that you put the jars into the directory lib/security which lives under the JRE home, typically that would be at $JAVA_HOME/jre.

It turned out, however, that this particular solution was necessary but not sufficient. In the same discussion there's a mention of the SSL problem that I ran into next:

[ERROR] [HttpServerConnection] Aborting encrypted connection to /0:0:0:0:0:0:0:1:52194 due to [SSLException:Unrecognized SSL message, plaintext connection?]

You need to edit the shell script  examples/single-query-handmade.sh which sends queries to the PredictionIO server as follows (add "k" to the curl options and "s" to the http protocol value):

curl -kH "Content-Type: application/json" -d '
{
    "user": "u1"

}' https://localhost:8000/queries.json

Finally, after all this work, I was able to see the blessedly wonderful response from my recommendation engine:

Recommendations for user: u1


{"itemScores":[{"item":"Galaxy","score":0.8880454897880554},{"item":"Nexus","score":0.24007925391197205},{"item":"Surface","score":0.043848853558301926}]}

Monday, September 19, 2016

Functions vs. methods in Scala

Here's a question that often comes up while teaching Scala: what's the difference between functions and methods. If you look this up on the web, you'll find many different answers, some of which are mathematically grounded but hardly helpful, others which are just plain confusing.

Here's how I like to explain it: a function is an object, just like any other expression: it has a type (which will be a subclass of FunctionN[T] where T is the type of the result and N is 0... according to the number of parameters the function requires). But note that the type is not a T but a function which must be invoked to yield a value of T. In Scala, this means that to get an actual T, you need to follow the function with a "(...)" where the number of parameters between the parentheses is N.

In Scala (or any other functional language for that matter), we can label expressions and functions by writing a statement like the following:

val a = 42
val f = { x: Int => x*x }

In the first case, our expression is quite simple: the constant 42 ("The answer to life, the universe and everything"). In the second case, it's a bit more complex (it's a function literal or "lamda"). But there's essentially no difference between the intent of the two statements. In each case, we have asserted that, from here on, whenever we write the variable "a", we mean 42 and wherever we write "f", we mean { x: Int => x*x }.

So much for functions. What about methods?

Methods aren't objects like functions. They are properties of a class or object. In an object-oriented language, if they are properties of a class, that implies that they are "instance methods", if of an object then they are just plain "methods".  A method is really just like a variable that is a property of a class/object. But it has a special syntax for its declaration which is more human-friendly than the syntax used above for the variable "f".

If we wanted to use the same function as a method in an object, then we could write it like this:

def m(x: Int) = x*x

This has exactly the same effect as with "f" above, except this time its label is "m" and it belongs to an object--it doesn't exist beyond the scope of that object. You can think of this form as syntactic sugar for "val m = { x: Int => x*x }".

So, what about instance methods? Instance methods are special to object-oriented programming because there is an implicit parameter which is generally written as this. It refers to the instance that is the "receiver" (class, owner, etc..) of the method.

Therefore we might define the following class:

class MyClass(x: Int) {
  def sqr = x*x
}

This time, we don't need to provide our own parameter to satisfy the need for "x" in the function (remember that this is essentially syntactic sugar for something that really looks like the definition of f above). The compiler supplies the parameter "this" and so "x" is now syntactic sugar for "this.x".

I do hope that I haven't made things even more confusing! And, I further hope that I haven't said anything that is clearly incorrect. But I don't believe so. However, if you consider yourself a Scala expert or are closely related to Martin Odersky, then feel free to add a clarifying or disputational comment.

Wednesday, September 7, 2016

Working with mutable state

The second of the questions I asked my students (see my earlier article) is essentially this:

In functional programming, how do we work with mutable state? This is a complex question in general as we may want to deal with I/O (i.e. external mutable state) or we may want to wrap the whole problem inside an actor and make it "go away." So, let's confine ourselves to dealing with internal mutable state in a referentially transparent way. We have an object whose internal state is changing in an unpredictable manner and we have another object which operates on that other mutable object. The first question that arises is how to test our object?
Let me propose a concrete example. An evolutionary computation framework uses a random number generator (RNG) to provide the entropy needed to get coverage of a solution space. If you simply create a scala.util.Random and take numbers from it, you will, sooner or later, run into the problem that your unit tests will fail. Not because the logic is incorrect but because, even if the order of running test cases is fixed (it usually is not so guaranteed), you will introduce new logic that pulls one more, or one fewer, numbers from the RNG and your expected results will be incorrect. Here's the kind of thing I have in mind (greatly simplified):

val random = new scala.util.Random(0L)
val someInt: Int = ???
val x1 = random.nextInt(100)
val x2 = random.nextInt(100)
// ...
val y = random.nextInt(100)
assert(y==someInt)
There's a simple solution for this. Just like a snail, that constructs and carries its home around with it, we can transform the code as follows:
class RNG[+A](f: Long=>A)(seed: Long) {
  private val random = new scala.util.Random(seed)
  private lazy val state = random.nextLong
  def next = new RNG(f)(state)
  def value = f(state)
}
val r = new RNG[Int](x => (x.toInt + 100) % 100)(0L)
val someInt: Int = ???
val r1 = r.next
val r2 = r.next
// ...
val rN = r2.next
val y = rN.value
assert(y==someInt)

There are many variations on this general theme, of course. But the important point is that we carry around our current RNG from one step to the next, rather than simply taking its value.

Here's another example which helps us strip arguments (from a variable number of args) with the appropriate types. The get method returns a Try of a tuple of a T and the remaining Args:

  case class Args(args: List[Any]) extends (() => Try[(Any,Args)]) {
    def apply(): Try[(Any, Args)] = args match {
      case Nil => Failure(new Exception("Args is empty: this typically means you didn't provide sufficient arguments"))
      case h :: t => Success(h, Args(t))
    }
    def get[T](clazz: Class[T]): Try[(T, Args)] = {
      apply match {
        case Success((r,a)) =>
          if (clazz.isInstance(r) || clazz.isAssignableFrom(r.getClass))
            Success(r.asInstanceOf[T], a)
          else throw new Exception(s"args head is not of type: $clazz but is of type ${r.getClass}")
        case f @ Failure(t) => f.asInstanceOf[Try[(T, Args)]]
      }
    }
    def isEmpty: Boolean = args.isEmpty
}

And here is how you would extract the arguments (in a FlatSpec with Matchers):

  def mainMethod(args: List[Any]): Unit = {
    val a1 = Args(args)
    val (x: String, a2) = a1.get(classOf[String]).get
    val (y, a3) = a2.get(classOf[java.lang.Boolean]).get
    a3.isEmpty shouldBe true    x shouldBe "hello"    y shouldBe true  }
  mainMethod(List("hello", true))

Note how we should end up with an empty list after stripping off all the arguments. And note also that if we want to add a type annotation (for example for x, as shown above) then we can do that and be quite sure that the get method is working properly. In this example, we didn't try to avoid the exceptions that may be thrown by the get methods on Try objects. We could of course do a nicer job with nested match statements.

On a completely different subject, I asked and answered my own question on StackOverflow today, regarding an issue with serializing/deserializing case class instances via Json.