Wednesday, August 22, 2012

Learning Mahout : Collaborative Filtering

My Mahout in Action (MIA) book has been collecting dust for a while now, waiting for me to get around to learning about Mahout. Mahout is evolving quite rapidly, so the book is a bit dated now, but I decided to use it as a guide anyway as I work through the various modules in the currently GA) 0.7 distribution.

My objective is to learn about Mahout initially from a client perspective, ie, find out what ML modules (eg, clustering, logistic regression, etc) are available, and which algorithms are supported within each module, and how to use them from my own code. Although Mahout provides non-Hadoop implementations for almost all its features, I am primarily interested in the Hadoop implementations. Initially I just want to figure out how to use it (with custom code to tweak behavior). Later, I would like to understand how the algorithm is represented as a (possibly multi-stage) M/R job so I can build similar implementations.

I am going to write about my progress, mainly in order to populate my cheat sheet in the sky (ie, for future reference). Any code I write will be available in this GitHub (Scala) project.

The first module covered in the book is Collaborative Filtering. Essentially, it is a technique of predicting preferences given the preferences of others in the group. There are two main approaches - user based and item based. In case of user-based filtering, the objective is to look for users similar to the given user, then use the ratings from these similar users to predict a preference for the given user. In case of item-based recommendation, similarities between pairs of items are computed, then preferences predicted for the given user using a combination of the user's current item preferences and the similarity matrix.

Anatomy of a Mahout Recommender

The input to such a system is either a 3-tuple of (UserID, ItemID, Rating) or a 2-tuple of (UserID, ItemID). In the latter case, the preferences are assumed to be boolean (ie, 1 for where the (UserID, ItemID) pair exists, 0 otherwise). In Mahout, this input is represented by a DataModel class, which can be created from a file of these tuples (either CSV or TSV), one per line. Other ways to populate the DataModel also exist, for example, programatically or from database.

A user-based Recommender is built out of a DataModel, a UserNeighborhood and a UserSimilarity. A UserNeighborhood defines the concept of a group of users similar to the current user - the two available implementations are Nearest and Threshold. The nearest neighborhood consists of the nearest N users for the given user, where nearness is defined by the similarity implementation. The threshold neighborhood consists of users who are at least as similar to the given user as defined by the similarity implementation. The UserSimilarity defines the similarity between two users - implementations include EuclideanDistance, Pearson Correlation, Uncentered Cosine, Caching, City Block, Dummy, Generic User, Log Likelihood, Spearman Correlation and Tanimoto Coefficient similarity.

An item-based Recommender is built out of a DataModel and a ItemSimilarity. Implementations of ItemSimilarity include Euclidean Distance, Pearson Correlation, Uncentered Cosine, City Block, Dummy, Log Likelihood, Tanimoto Coefficient, Caching Item, File Item, and Generic Item similarity. The MIA book describes each algorithm in greater detail.

In both cases, an IDRescorer object can be used to modify the recommendations with some domain logic, either by filtering out some of the recommendations (using isFiltered(itemID : Long) : Boolean) or by boosting/deboosting the recommendation score (using rescore(itemID : Long, originalScore : Double) : Double).

Evaluating Recommenders

A Mahout user would build a Recommender using the "right" mix of the components described above. What is a right mix is not readily apparent, so we can just ask the computer. Mahout provides three evaluation metrics, the Average Absolute Difference, Root Mean Square Difference and the IR Stats (which provides precision and recall at N). Here is some some code that uses the IRStats evaluator to run a sample of the input against various combinations and reports the precision and recall at a certain point.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
package com.mycompany.mia.cf

import java.io.File
import scala.collection.mutable.HashMap
import org.apache.mahout.cf.taste.common.Weighting
import org.apache.mahout.cf.taste.eval.{RecommenderBuilder, IRStatistics}
import org.apache.mahout.cf.taste.impl.eval.GenericRecommenderIRStatsEvaluator
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel
import org.apache.mahout.cf.taste.impl.model.{GenericDataModel, GenericBooleanPrefDataModel}
import org.apache.mahout.cf.taste.impl.neighborhood.{ThresholdUserNeighborhood, NearestNUserNeighborhood}
import org.apache.mahout.cf.taste.impl.recommender.{GenericUserBasedRecommender, GenericItemBasedRecommender}
import org.apache.mahout.cf.taste.impl.similarity.{UncenteredCosineSimilarity, TanimotoCoefficientSimilarity, PearsonCorrelationSimilarity, LogLikelihoodSimilarity, EuclideanDistanceSimilarity, CityBlockSimilarity}
import org.apache.mahout.cf.taste.model.DataModel
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood
import org.apache.mahout.cf.taste.recommender.Recommender
import org.apache.mahout.cf.taste.similarity.{UserSimilarity, ItemSimilarity}
import scala.io.Source
import scala.util.Random
import java.io.PrintWriter

object RecommenderEvaluator extends App {

  val neighborhoods = Array("nearest", "threshold")
  val similarities = Array("euclidean", "pearson", "pearson_w", 
                           "cosine", "cosine_w", "manhattan", 
                           "llr", "tanimoto")
  val simThreshold = 0.5
  val sampleFileName = "/tmp/recommender-evaluator-sample.tmp.csv"
    
  val argmap = parseArgs(args)
  
  val filename = if (argmap("sample_pct") != null) {
    val nlines = Source.fromFile(argmap("filename")).size
    val sampleSize = nlines * (argmap("sample_pct").toFloat / 100.0)
    val rand = new Random(System.currentTimeMillis())
    var sampleLineNos = Set[Int]()
    do {
      sampleLineNos += rand.nextInt(nlines)
    } while (sampleLineNos.size < sampleSize)
    val out = new PrintWriter(sampleFileName)
    var currline = 0
    for (line <- Source.fromFile(argmap("filename")).getLines) {
      if (sampleLineNos.contains(currline)) {
        out.println(line)
      }
      currline += 1
    }
    out.close()
    sampleFileName
  } else {
    argmap("filename")
  }
  
  val model = argmap("bool") match {
    case "false" => new GenericDataModel(
      GenericDataModel.toDataMap(new FileDataModel(
      new File(filename)))) 
    case "true" => new GenericBooleanPrefDataModel(
      GenericBooleanPrefDataModel.toDataMap(new FileDataModel(
      new File(filename))))
    case _ => throw new IllegalArgumentException(
      invalidValue("bool", argmap("bool")))
  }
  val evaluator = new GenericRecommenderIRStatsEvaluator()

  argmap("type") match {
    case "user" => {
      for (neighborhood <- neighborhoods;
           similarity <- similarities) {
        println("Processing " + neighborhood + " / " + similarity)
        try {
          val recommenderBuilder = userRecoBuilder(
            neighborhood, similarity, 
            model.asInstanceOf[GenericDataModel])
          val stats = evaluator.evaluate(recommenderBuilder, 
            null, model, 
            null, argmap("precision_point").toInt, 
            GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 
            argmap("eval_fract").toDouble)
            printResult(neighborhood, similarity, stats)
        } catch {
          case e => {
            println("Exception caught: " + e.getMessage)
          }
        }
      }
    }
    case "item" => {
      for (similarity <- similarities) {
        println("Processing " + similarity)
        try {
          val recommenderBuilder = itemRecoBuilder(similarity, 
            model.asInstanceOf[GenericDataModel])
          val stats = evaluator.evaluate(recommenderBuilder, null, 
            model, null, argmap("precision_point").toInt,
            GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,
            argmap("eval_fract").toDouble)
          printResult(null, similarity, stats)
        } catch {
          case e => {
            println("Exception caught: " + e.getMessage)
          }
        }
      }
    }
    case _ => throw new IllegalArgumentException(
      invalidValue("type", argmap("type")))
  }

  def usage() : Unit = {
    println("Usage:")
    println("com.mycompany.mia.cf.RecommenderEvaluator [-key=value...]")
    println("where:")
    println("sample_pct=0-100 (use sample_pct of original input)")
    println("type=user|item (the type of recommender to build)")
    println("bool=true|false (whether to use boolean or actual preferences)")
    println("precision_point=n (the precision at n desired)")
    println("eval_fract=n (fraction of data to use for evaluation)")
    System.exit(1)
  }
  
  def parseArgs(args : Array[String]) : HashMap[String,String] = {
 val argmap = new HashMap[String,String]()
    for (arg <- args) {
      val nvp = arg.split("=")
      argmap(nvp(0)) = nvp(1)
    }
    argmap
  }

  def invalidValue(key : String, value : String) : String = {
    "Invalid value for '" + key + "': " + value
  }
  
  def itemRecoBuilder(similarity : String, 
      model : GenericDataModel) : RecommenderBuilder = {
    val s : ItemSimilarity = similarity match {
      case "euclidean" => new EuclideanDistanceSimilarity(model)
      case "pearson" => new PearsonCorrelationSimilarity(model)
      case "pearson_w" => new PearsonCorrelationSimilarity(
        model, Weighting.WEIGHTED)
      case "cosine" => new UncenteredCosineSimilarity(model)
      case "cosine_w" => new UncenteredCosineSimilarity(
        model, Weighting.WEIGHTED)
      case "manhattan" => new CityBlockSimilarity(model)
      case "llr" => new LogLikelihoodSimilarity(model)
      case "tanimoto" => new TanimotoCoefficientSimilarity(model)
      case _ => throw new IllegalArgumentException(
        invalidValue("similarity", similarity))
    }
    new RecommenderBuilder() {
      override def buildRecommender(model : DataModel) : Recommender = {
        new GenericItemBasedRecommender(model, s)
      }
    }
  }
  
  def userRecoBuilder(neighborhood : String, 
      similarity : String,
      model : GenericDataModel) : RecommenderBuilder = {
    val s : UserSimilarity = similarity match {
      case "euclidean" => new EuclideanDistanceSimilarity(model)
      case "pearson" => new PearsonCorrelationSimilarity(model)
      case "pearson_w" => new PearsonCorrelationSimilarity(
        model, Weighting.WEIGHTED)
      case "cosine" => new UncenteredCosineSimilarity(model)
      case "cosine_w" => new UncenteredCosineSimilarity(
        model, Weighting.WEIGHTED)
      case "manhattan" => new CityBlockSimilarity(model)
      case "llr" => new LogLikelihoodSimilarity(model)
      case "tanimoto" => new TanimotoCoefficientSimilarity(model)
      case _ => throw new IllegalArgumentException(
        invalidValue("similarity", similarity))
    }
    val neighborhoodSize = if (model.getNumUsers > 10) 
      (model.getNumUsers / 10) else (model.getNumUsers)
    val n : UserNeighborhood = neighborhood match {
      case "nearest" => new NearestNUserNeighborhood(
        neighborhoodSize, s, model) 
      case "threshold" => new ThresholdUserNeighborhood(
        simThreshold, s, model)
      case _ => throw new IllegalArgumentException(
        invalidValue("neighborhood", neighborhood))
    }
    new RecommenderBuilder() {
      override def buildRecommender(model : DataModel) : Recommender = {
        new GenericUserBasedRecommender(model, n, s)
      }
    }
  }
  
  def printResult(neighborhood : String, 
      similarity : String, 
      stats : IRStatistics) : Unit = {
    println(">>> " + 
      (if (neighborhood != null) neighborhood else "") + 
      "\t" + similarity +
      "\t" + stats.getPrecision.toString + 
      "\t" + stats.getRecall.toString)
  }
}

The command below will run the evaluator for all the item based recommenders using 10% of the MovieLens 1M rating file and report the precision and recall at 2 for each recommender. Note that some recommenders may not give you results because there is not sufficient data. This evaluator is a work in progress so I may change it to use one of the other evaluation metrics.

1
2
3
4
5
sujit@cyclone:mia-scala-examples$ sbt 'run-main \
  com.mycompany.mia.cf.RecommenderEvaluator \
  filename=data/ml-ratings.dat \
  type=item bool=false \
  precision_point=2 sample_pct=10 eval_fract=1'

Running the Recommender

As you can see, Mahout provides nice building blocks to build Recommenders with. Recommenders built using this framework can be run on a single machine (local mode) or on Hadoop via the so-called pseudo-distributed RecommenderJob that splits the input file across multiple Recommender reducers. To run in the latter mode, the recommender must have a constructor that takes a DataModel. Presumably if I am building a recommender with this framework, I would want it to scale up in this manner, so it makes sense to build it as the framework requires. Here is a custom Item-based recommender that uses the PearsonCorrelation similarity. The file also contains a runner that can be used for calling/testing in local mode.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
package com.mycompany.mia.cf

import java.io.File
import java.util.List

import scala.collection.JavaConversions.asScalaBuffer
import scala.collection.mutable.HashSet
import scala.io.Source

import org.apache.mahout.cf.taste.common.Refreshable
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel
import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity
import org.apache.mahout.cf.taste.model.DataModel
import org.apache.mahout.cf.taste.recommender.{Recommender, RecommendedItem, IDRescorer}

object MovieLensRecommenderRunner extends App {
  // grab the input file name
  val filename = if (args.length == 1) args(0) else "unknown"
  if ("unknown".equals(filename)) {
    println("Please specify input file")
    System.exit(-1)
  }
  // train recommender
  val recommender = new MovieLensRecommender(
    new FileDataModel(new File(filename)))
  // test recommender
  val alreadySeen = new HashSet[Long]()
  val lines = Source.fromFile(filename).getLines
  for (line <- lines) {
    val user = line.split(",")(0).toLong
    if (! alreadySeen.contains(user)) {
      val items = recommender.recommend(user, 100)
      println(user + " =>" + items.map(x => x.getItemID).
        foldLeft("")(_ + " " + _))
    }
    alreadySeen += user
  }
}

class MovieLensRecommender(model : DataModel) extends Recommender {

  val similarity = new PearsonCorrelationSimilarity(model)
  val delegate = new GenericItemBasedRecommender(model, similarity)

  // everything below this is boilerplate. We could use the 
  // RecommenderWrapper if it was part of Mahout-Core but its part 
  // of the Mahout-Integration for the webapp
  
  def recommend(userID: Long, howMany: Int): List[RecommendedItem] = {
    delegate.recommend(userID, howMany)
  }

  def recommend(userID: Long, howMany: Int, rescorer: IDRescorer): List[RecommendedItem] = {
    delegate.recommend(userID, howMany, rescorer)
  }

  def estimatePreference(userID: Long, itemID: Long): Float = {
    delegate.estimatePreference(userID, itemID)
  }

  def setPreference(userID: Long, itemID: Long, value: Float): Unit = {
    delegate.setPreference(userID, itemID, value)
  }

  def removePreference(userID: Long, itemID: Long): Unit = {
    delegate.removePreference(userID, itemID)
  }

  def getDataModel(): DataModel = {
    delegate.getDataModel()
  }

  def refresh(alreadyRefreshed: java.util.Collection[Refreshable]): Unit = {
    delegate.refresh(alreadyRefreshed)
  }
}

We can run this in local mode on the command line using sbt as shown below:

1
2
3
4
5
6
7
sujit@cyclone:mia-scala-examples$ sbt 'run-main \
  com.mycompany.mia.cf.MovieLensRecommenderRunner data/intro.csv'
1 => 104
2 =>
3 => 103 102
4 => 102
5 =>

For running within the Hadoop pseudo-distributed RecommenderJob, it may be nice to have a little more data. We can use the MovieLens 1M rating file, with a few Unix transformations to make it look like a larger version of the intro.csv file. (You can also use a custom DataModel to read the format, apparently one is available in Mahout according to the MIA book).

1
2
sujit@cyclone:data$ cat ~/Downloads/ml-ratings.dat | \
  awk 'BEGIN{FS="::"; OFS=",";} {print $1, $2, $3}' > ml-ratings.dat 

Next, you should have Hadoop set up (I have it set up to work in pseudo-distributed mode, ie a cluster of 1 (not to be confused with the Mahout pseudo-distributed RecommenderJob)).

Finally make sure that the $MAHOUT_HOME/bin/mahout script works. It runs in the context of the Mahout distribution, so it can barf if it does not find the JARs in the places it expects them to be in. To remedy this, run mvn -DskipTest install to create all the required JAR files. Then make sure that HADOOP_HOME and HADOOP_CONF_DIR enviroment variables are set to point to your Hadoop installation. Granted, the script is not very useful (see below for why not) right now, but it will come in handy when trying to use standard Mahout functionality without having to worry about pesky classpath issues. So making it work is time well-spent.

Once I got the bin/mahout script working, I discovered that it wouldn't find my custom Recommender class, even though the JAR appeared in the script's classpath. Ultimately, based on advice on this Stack Overflow page and error messages from Hadoop, I ended up un-jarring up all of mahout-core-0.7-job.jar, scala-library.jar and the classes under my project, and jarring them back together into one big fat JAR. Apparently sbt has a assembly plugin you can customize to do this, but my sbt-fu is weak to non-existent, so I used a shell script instead[1]. Once the fat jar is built, you can invoke the pseudo-distributed Recommender job like so:

1
2
3
4
5
6
7
hduser@cyclone:mahout-distribution-0.7$ hadoop jar \
  my-mahout-fatjar.jar \
  org.apache.mahout.cf.taste.hadoop.pseudo.RecommenderJob \
  --recommenderClassName com.mycompany.mia.cf.MovieLensRecommender \
  --numRecommendations 100 \
  --input input/ml-ratings.dat --output output \
  --usersFile input/users.txt

Mahout also provides a Item RecommenderJob that is not based on the framework described above. Rather, this is a sequence of Hadoop M/R jobs that use a item-item similarity matrix to calculate the recommendations. For customization, it only allows you to pass in a Similarity measure (which implements VectorSimilarityMeasure). Mahout already provides quite a few to choose from (City Block, Co-occurrence, Cosine, Euclidean, Log Likelihood, Pearson, and Tanimoto). I did not investigate writing my own custom implementation. Here is the command to run this job using bin/mahout (since no customization was done).

1
2
3
4
5
hduser@cyclone:mahout-distribution-0.7$ bin/mahout \
  org.apache.mahout.cf.taste.hadoop.item.RecommenderJob \
  --input input/ml-ratings.dat --output output \
  --numRecommendations 10 --usersFile input/users.txt \
  --similarityClassname SIMILARITY_PEARSON_CORRELATION

Alternatively, in case you did need to do customizations, you could also just invoke the fat JAR using the bin/hadoop script instead.

1
2
3
4
5
6
hduser@cyclone:mahout-distribution-0.7$ hadoop jar 
  my-mahout-fatjar.jar \
  org.apache.mahout.cf.taste.hadoop.item.RecommenderJob \
  --input input/ml-ratings.dat --output output \
  --numRecommendations 10 --usersFile input/users.txt \
  --similarityClassname SIMILARITY_PEARSON_CORRELATION

Finally, Mahout provides the ALS (Alternating Least Squares) implementation of RecommenderJob. Like the previous RecommenderJob, this also does not use the Recommender framework. It uses an algorithm that was used in one of the entries for the Netflix Prize. I did not try it because it needs feature vectors built first, which I don't yet know how to do. You can find more information in this Mahout JIRA.

So anyway, thats all I have for today. Next up on my list is Clustering with Mahout.

[1] Update 2012-08-31 - Copying scala-libs.jar into $HADOOP_HOME/lib eliminates the need to repackage scala-libs.jar in the fat jar, but you will have to remember to do so (once) whenever your Hadoop version changes. I have decided to follow this strategy, and I have made changes to the fatjar.sh script in GitHub.

Sunday, August 12, 2012

Scalding for the Impatient

Few weeks ago, I wrote about Pig, a DSL that allows you to specify a data processing flow in terms of PigLatin operations, and results in a sequence of Map-Reduce jobs on the backend. Cascading is similar to Pig, except that it provides a (functional) Java API to specify a data processing flow. One obvious advantage is that everything can now be in a single language (no more having to worry about UDF integration issues). But there are others as well, as detailed here and here.

Cascading is well documented, and there is also a very entertaining series of articles titled Cascading for the Impatient that builds up a Cascading application to calculate TF-IDF of terms in a (small) corpus. The objective is to showcase the features one would need to get up and running quickly with Cascading.

Scalding is a Scala DSL built on top of Cascading. As you would expect, Cascading code is an order of magnitude shorter than equivalent Map-Reduce code. But because Java is not a functional language, implementing functional constructs leads to some verbosity in Cascading that is eliminated in Scalding, leading to even shorter and more readable code.

I was looking for something to try my newly acquired Scala skills on, so I hit upon the idea of building up a similar application to calculate TF-IDF for terms in a corpus. The table below summarizes the progression of the Cascading for the Impatient series. I've provided links to the original articles for the theory (which is very nicely explained there) and links to the source codes for both the Cascading and Scalding versions.

Article[1] Description #-mappers #-reducers Code[2]
Part 1 Distributed File Copy 1 0
Part 2 Word Count 1 1
Part 3 Word Count with Scrub 1 1
Part 4 Word Count with Scrub and Stop Words 1 1
Part 5 TF-IDF 11 9

[1] - points to "Cascading for the Impatient Articles"
[2] - Cascading version links points to Paco Nathan's github repo, Scalding versions point to mine.

The code for the Scalding version is fairly easy to read if you know Scala (somewhat harder, but still possible if you don't). The first thing to note is the relative sizes - Scalding code is shorter and more succint than the Cascading version. The second thing to note is that the Scalding based code uses method calls that are not Cascading methods. You can read about the Scalding methods in the API Reference (I used the Fields-based reference exclusively). The tutorial and example code in the Scalding distribution is also helpful.

Project Setup

As you can see, I created my own Scala project and used Scalding as a dependency. I describe the steps here so you can do the same if you are so inclined.

Assuming you are going to be using Scalding in your applications, you need to download and build the Scalding JAR, then publish it to your local (or corporate) code repository (sbt uses ivy2). To do this, run the following sequence of commands:

1
2
3
4
sujit@cyclone:scalding$ git clone https://github.com/twitter/scalding.git
sujit@cyclone:scalding$ cd scalding
sujit@cyclone:scalding$ sbt assembly # build scalding jar
sujit@cyclone:scalding$ sbt publish-local # to add to local ivy2 repo.

Scalding also comes with a ruby script scald.rb that you use to run Scalding jobs. It is quite convenient to use - it forces all arguments to be named (resulting in cleaner/explicit argument handling code) and allows switching from local development to hadoop mode using a single switch. It is available in the scripts subdirectory of your scalding download. To use it outside Scalding (ie, in your own project), you will need to soft link it to a directory in your PATH. Copying it does not work because it has dependencies to other parts of the Scalding download.

The next step is to generate and setup your Scalding application project. Follow these steps:

  1. Generate your project using g8iter - type g8 typesafehub/scala-sbt at the command line, and answer the prompts. Your project is created as a directory named by Scala Project Name.
  2. Move to the project directory - type cd scalding-impatient (in my case).
  3. Build a basic build.sbt - create a file build.sbt in the project base directory and populate it with the key value pairs for name, version and scalaVersion (blank lines between pairs are mandatory).
  4. Copy over scalding libraryDependencies - copy over the libraryDependencies lines from the scalding build.sbt file and drop it in to your project's build.sbt. I am not sure if this is really necessary, or whether scalding declares its transitive dependencies which will be picked up with a single dependency declaration to scalding (see below). You may want to try omitting this step and see - if you succeed please let me know and I will update accordingly.
  5. Add the scalding libraryDependency - define the libraryDependency to the scalding JAR you just built and published. The line to add is libraryDependencies += "com.twitter" % "scalding_2.9.2" % "0.7.3".
  6. Rebuild your Eclipse files - Check my previous post for details about SBT-Eclipse setup. If you are all set up, then type sbt eclipse to generate these. Your project is now ready for development using Eclipse.

And thats pretty much it. Hope you have as much fun coding in Scala/Scalding as I did.

Saturday, August 11, 2012

Learning Scala ... again

Recently, I read that Java (my primary language and what Steve Yegge calls "your father's language" :-)) was borrowing features from Scala for its upcoming version 8 release. This is obviously good news for Java, but as one of the commentors on the article pointed out, Java programmers now have two choices - either wait for Scala features to trickle into Java and then figure out how to use them, or learn Scala now in anticipation of these features to come into Java. And once you learn Scala, why not just start using it instead of Java?

I looked at Scala some three years ago (see here, here and here), when I was experimenting with its Actor model. Although I liked the language at the time, the audience seemed to be more of language enthusiast types rather than application programmer types. Plus, there was almost no supporting ecosystem (frameworks, tooling, etc) for Scala. So I decided to set it aside for a while and come back to it once it got a bit more mature.

Fast forward to 2012, and Scala has come a long way. Scala always had a dedicated and smart community of developers, but now it is being increasingly adopted by quite a few big name companies. There is much more (application programmer friendly) documentation and better frameworks and tooling around. So, all in all, a good time to learn and start working in Scala for me.

I don't anticipate using Scala at work, at least not in the immediate future, but I figured it may be good to start using it for my own stuff instead of Java - that way, I can start putting in my 10,000 hours towards Scala proficiency. So this post is mostly about my experience picking up Scala again, and about setting up a standard Scala project with the Typesafe software stack.

The last time I attempted to learn Scala, I used the Odersky book, the first and at that time the only book on Scala. This time round, almost coencidentally (or through some very effective contextual ad targeting), I came across Cay Horstmann's Scala for the Impatient (SFTI) book, which helped me pick up a working knowledge of Scala in about 1.5 weeks.

The SFTI book is written for Java programmers rather than the novice, so it assumes that you know the basic stuff. At the same time, the focus is on doing things (by example) in Scala that you can do (either poorly or not at all) in Java. Each book chapter (and sometimes section) is annotated with the Scala expertise levels (A1 to A3 for Application developers, L1 to L3 for Library developers), so you can decide what proficiency level you want to for for initially (A2/L1 for me), and not feel too guilty or waste too much time if you don't fully understand some concept or can't solve an exercise problem above that level. All in all, a book geared to get you writing useful Scala code as quickly as possible.

Its a bit of a no-brainer, but speaking of exercises, don't skip them. They help reinforce the concepts you've learned, and by the end of the book, you will be the proud owner of 150 or so machine searchable (grep) and potentially reusable Scala code snippets that you have written (and therefore understand intimately). Also do yourself a favor and download ScalaConsole - its a JAR file which you invoke as "scala /path/to/scalaconsole.jar", and provides a GUI which is much nicer to edit code in compared to the Scala REPL. Another advantage of doing the exercises is that your mind learns better by doing than seeing, so you are better prepared when the time comes to write real code.

So anyway, after you go through the book and you have learned enough Scala to be comfortable striking out on your own, its time to set up a Scala project and your IDE to work comfortably with Scala code. I chose the Typesafe Stack consisting of Scala, sbt (Scala Build Tool) and giter8 (to generate the project). In any case, to create a new project:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sujit@cyclone:LearnScala$ g8 typesafehub/scala-sbt

Scala Project Using sbt 

organization [org.example]: com.mycompany
name [Scala Project]: hello-world
scala_version [2.9.2]: 
version [0.1-SNAPSHOT]: 

Applied typesafehub/scala-sbt.g8 in hello-world

This creates a standard sbt enabled Scala project similar in structure to one created by Maven. Like Maven, your source code resides under src/main/scala and your unit tests reside under src/test/scala. Unlike Maven, your build is customized by a file of key-value pairs called build.sbt in the project directory. There is also a project subdirectory which contains generated Scala code for a default build, which you can change if you want to customize the build.

The standard tasks in sbt are similar to those in Maven. A list of common commands can be found in the sbt Getting Started Guide.

The next step is to install the ScalaIDE plugin into MyEclipse. I did this using the update site and everything worked fine.

The final step is to generate the Eclipse .classpath and .project files. There is a sbteclipse plugin from Heiko Seeberger which works without problems (unlike the earlier sbteclipsify which I couldn't get working after multiple tries). Simply add the following line to your $HOME/.sbt/plugins/build.sbt:

1
addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "2.1.0")

Now run "sbt eclipse" in your project directory. This will create the .classpath and .project files. Note that as your project's library dependencies change, you can simply update your project's build.sbt and rerun "sbt eclipse" to regenerate the Eclipse files.

Finally you can open up the Scala Project in Eclipse. Unlike three years ago, when Eclipse/Scala integration was completely unusable, this time around its actually quite nice. It flags syntax errors, and has decent (still not as nice as Java, but usable) code completion. You can compile, run and debug Scala code from within the IDE.