Thursday, December 23, 2010

Quick and Dirty Reporting with Awk

A few years ago, I wrote about starting to use awk after a long time. Over the last couple of years, I've used awk on and off, although not very frequently, and never beyond the basic pattern described in that post.

Last week, I described an AspectJ aspect that wrote out timing information into the server logs. I initially thought of using OpenOffice Calc, but then stumbled upon Jadu Saikia's post on implementing group-by with awk, and I realized that awk could do the job just as easily, and would be easier to use (for multiple invocations).

Here is the script I came up with.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/env awk -f
# Builds a report out of aggregate elapsed time data similar to the Spring 
# StopWatch prettyPrint report.
# Usage:
# my_profile_report.awk [-v qt=query_term] filename
#
BEGIN {
  FS="|";
}
(qt == "" || qt == $3) {
  counts[$4]++;
  times[$4] += $5;
  if (mins[$4] + 0 != 0) {
    if (mins[$4] > $5) {
      mins[$4] = $5;
    }
  } else {
    mins[$4] = $5;
  }
  if (maxs[$4] + 0 != 0) {
    if (maxs[$4] < $5) {
      maxs[$4] = $5;
    }
  } else {
    maxs[$4] = $5;
  }
  totals += $5;
}
END {
  totavg = 0;
  for (t in times) {
    totavg += times[t]/counts[t];
  }
  printf("---------------------------------------------------------------------\n");
  printf("%-32s %8s %8s %8s %8s\n", "Controller", "Avg(ms)", "%", "Min(ms)", "Max(ms)");
  printf("---------------------------------------------------------------------\n");
  for (t in times) {
    avg = times[t]/counts[t];
    perc = avg * 100 / totavg;
    printf("%-32s %8.2f %8.2f %8.2f %8.2f\n", t, avg, perc, mins[t], maxs[t]);
  }
  printf("---------------------------------------------------------------------\n");
  printf("%-32s %8.2f\n", "Average Elapsed (ms):", totavg);
  printf("---------------------------------------------------------------------\n");
}

As shown in my previous post, the input file looks something like this (without the header, I just grep the server log file with the aspect name).

1
2
3
4
5
6
# aspect_name|request_uuid|query|tile_name|elapsed_time_millis
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileAController|132
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileBController|49
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileCController|2
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileDController|3
...

The script can be invoked with or without a qt parameter. Without parameters, the script computes the average, minimum and maximum elapsed times across all the queries that were given to the application during the profiling run. The report looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
sujit@cyclone:aspects$ ./my_profile_report.awk input_file
---------------------------------------------------------------------
Controller                        Avg(ms)        %  Min(ms)  Max(ms)
---------------------------------------------------------------------
TileFController                      1.42     0.02     0.00    42.00
TileAController                    187.67     2.83    27.00   685.00
TileJController                    169.97     2.56     7.00  3140.00
TileEController                      9.14     0.14     2.00    44.00
TileNController                     45.91     0.69     4.00   234.00
TileIController                    444.91     6.71     0.00  3427.00
TileDController                   1506.30    22.72   792.00 12140.00
TileMController                    184.88     2.79     0.00  3078.00
TileHController                     13.67     0.21     7.00    50.00
TileCController                     34.06     0.51    14.00   108.00
TileLController                    759.73    11.46     3.00  9921.00
TileGController                   2473.24    37.31   579.00 10119.00
TileBController                     24.48     0.37     7.00   132.00
TileKController                    773.97    11.67     0.00  8554.00
---------------------------------------------------------------------
Average Elapsed (ms):             6629.35
---------------------------------------------------------------------

If we want only the times for a specific query term, then we can specify it on the command line as shown below. If the query has been invoked multiple times, then the report will show the average of the calls for the query. In this case, there is only a single invocation for the query, so the minimum and maximum times are redundant.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
sujit@cyclone:aspects$ ./my_profile_report.awk -v qt=some_query input_file
---------------------------------------------------------------------
Controller                        Avg(ms)        %  Min(ms)  Max(ms)
---------------------------------------------------------------------
TileFController                      0.00     0.00     0.00     0.00
TileAController                    106.00     4.57   106.00   106.00
TileJController                      7.00     0.30     7.00     7.00
TileEController                     10.00     0.43    10.00    10.00
TileNController                      8.00     0.34     8.00     8.00
TileIController                      0.00     0.00     0.00     0.00
TileDController                   1309.00    56.42  1309.00  1309.00
TileMController                      0.00     0.00     0.00     0.00
TileHController                     12.00     0.52    12.00    12.00
TileCController                     25.00     1.08    25.00    25.00
TileLController                     64.00     2.76    64.00    64.00
TileGController                    767.00    33.06   767.00   767.00
TileBController                     12.00     0.52    12.00    12.00
TileKController                      0.00     0.00     0.00     0.00
---------------------------------------------------------------------
Average Elapsed (ms):             2320.00
---------------------------------------------------------------------

The script uses awk's associative arrays and the post processing block. Before this script, I had read about awk's BEGIN/END blocks but didn't know why I would want to use them, and I didn't know about awk's associative arrays at all. Thanks to Daniel Robbin's three part series - Common threads: Awk by example, Part 1, 2 and 3, I now have a greater understanding and appreciation of awk.

Friday, December 17, 2010

Profiling Spring Tiles with Custom Aspects

Introduction

One way of "componentizing" a web page is by composing it out of "tiles". Each tile encapsulates the logic needed to render itself. Once the tiles are built, they can be easily mixed and matched to provide different page renderings using different groupings of tiles. The Spring Framework provides support for tiles via the Apache Struts Tiles subproject.

With Spring, the page itself is controlled by a Spring Controller. When a request comes in, the controller instantiates a Model object, optionally does some pre-processing on it, sticks the model into the request, and sends off the request to its JSP. The JSP contains one or more tiles:insert calls referencing named tiles in the tiles-def.xml file. Each named tile is associated with a Tile Controller and a JSP. As the main JSP renders, it calls upon each of its tiles to render themselves, which invokes the associated Tile Controller. Each Tile Controller grabs the model off the request, does its own processing to populate the model, then sticks the model back into the request, where it is picked up by its JSP for rendering.

Motivation

Recently, one such page some of us have been working on was rendering real slow. So I figured it would be useful to get some timing information on how long each individual Tile Controller was taking, and optimize them. Unfortunately, I could not think of a good way to do this with the standard Spring approach of wrapping beans with interceptor proxies, since the Tile Controllers are defined as class names in the tiles-def.xml configuration, not bean references.

So the solution I came up with was to write an AspectJ aspect that would be woven into the application's JAR file at compile time to produce the "profiling" version of the JAR. This "profiling" JAR would be swapped into the WEB-INF/lib sub-directory of the servlet container's (in our case Tomcat) context directory for the application, along with the AspectJ Runtime JAR, followed by a container restart.

With this approach, there is no change to the application code. The aspect code is in a completely different project. Once profiling is done and we have our data, we simply replace the application's JAR file with the original one.

Since I was new to AspectJ, I had to do a bit of reading and experimenting before I got this stuff to work. There is a lot of AspectJ information available on the web, but none of them describes in detail how one would go about doing something like this. It seems to me that a lot of people would want to do this if they knew how, so hopefully this post would help someone. If you can think of other ways to do this, would appreciate hearing from you.

Steps

Here are the steps to build and compile-time weave an aspect into a web application's JAR file.

  • Install AspectJ and configure (one time).
  • Create an AspectJ project and write the Aspect.
  • Generate the source JAR (ant jar) from the web application to be profiled.
  • Compile and Weave the Aspect into the source JAR to produce a woven version of the JAR.
  • Generate web application WAR file and deploy to Tomcat.
  • Stop Tomcat, replace the JAR file in /webapps/${app}/WEB-INF/lib with the woven version of the JAR.
  • Copy the aspectjrt.jar into the same /webapps/${app}/WEB-INF/lib directory.
  • Restart Tomcat.
  • Execute HTTP GETs against the application, profiling output is captured into the log file.

AspectJ Installation

I downloaded the latest AspectJ JAR (1.6.10 for me) from the AspectJ Download site. Following the instructions on the site, I just ran the downloaded JAR (java -jar) to install AspectJ under my /opt/aspectj1.6 (ASPECTJ_HOME) directory.

I then added $ASPECTJ_HOME/bin to my PATH environment variable so the AspectJ commands (ajc, the AspectJ analog of javac) are available on the command line.

I also had to increase the -Xmx setting for ajc (a shell script in $ASPECTJ_HOME/bin) from 64M to 256M because it was running out of memory during a compile. The resulting JAR is only slightly larger after weaving.

Creating the Aspect

I initially planned on building this as a private little hack for my own use, so I could tell how the code changes I was doing affected overall performance, so I thought of using the Spring StopWatch to report on this. The diagram below illustrates the setup I was trying to profile:

As you can see, there is a single Spring controller which hands off to a Tiles view with three tiles A, B and C. I wanted to find how much time was being spent overall, and how much time was being spent in each of the three tiles.

Capturing the elapsed time for each tile was easy. The problem was to track when all the tiles were done, so I could print the contents of the stopwatch. Since the Spring controller hands off control to the Tiles view, which then calls the Tiles Controller for each tile, the only way to do this was to do it at the beginning of the next request. Also the static StopWatch reference is not thread-safe, so its state can get corrupted if you hit the application too fast (so one request may not have finished before you hand it another one). So its kind of useful, as long as you are willing to live with its warts. Here is the aspect code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Source: src/profilers/MyAspect1.aj
package profilers;

import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.springframework.util.StopWatch;

/**
 * MyController Tile Profiler for individual use. This dumps out the 
 * stopwatch contents for a matched request on the request /following/ 
 * the one that was profiled. Because it uses a static StopWatch instance,
 * it is not thread safe, so be sure to wait a bit between calls, otherwise, 
 * you will get errors.
 */
public aspect MyAspect1 {

  private static StopWatch stopwatch = null;
  
  pointcut handleRequestInternal(HttpServletRequest request, 
      HttpServletResponse response) :
    execution(public * com.myco.myapp.MyController.handleRequestInternal(..))
    && args(request, response);

  pointcut doPerform() : 
    execution(public * com.myco.myapp.tiles..*.doPerform(..)); 

  before(HttpServletRequest request, HttpServletResponse response) : 
      handleRequestInternal(request, response) {
    if (stopwatch != null) {
      System.out.println(stopwatch.prettyPrint());
    }
    String q1 = request.getParameter("q1");
    stopwatch = new StopWatch(q1 == null ? "UNKNOWN" : q1);
  }
  
  before() : doPerform() {
    stopwatch.start(thisJoinPoint.getTarget().getClass().getSimpleName());
  }
  
  after() : doPerform() {
    if (stopwatch.isRunning()) {
      stopwatch.stop();
    }
  }
}

Applying this aspect to the JAR file and running the web application gave me output in my logs that looked like this. As you can see, the output (with names and data changed to protect the innocent and not-so-innocent) is pretty but the information is not.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
StopWatch 'Some Query': running time (millis) = 12000
-----------------------------------------
ms     %     Task name
-----------------------------------------
00870  007%  TileAController
00265  002%  TileBController
00059  000%  TileCController
00002  000%  TileDController
00003  000%  TileEController
00000  000%  TileFController
08880  074%  TileGController
00031  000%  TileHController
00209  002%  TileIController
00118  001%  TileJController
00965  008%  TileKController
00000  000%  TileLController
00003  000%  TileMController
00592  005%  TileNController
00003  000%  TileOController

To get around the two limitations of the implementation above, I decided to forego the convenience of the Spring StopWatch and capture plain elapsed times into the log file (with additional information to tie the elapsed time entries together), and then use scripts to post-process this information. My new setup looks something like this:

I use the fact that a call to Spring's modelAndView.addObject(name, value) is really the same as request.setAttribute(name, value). Before the handleRequest() call in the Spring controller, I generate a unique UUID for the request and stick it into the attribute so it can be picked up by the before() and after() advice in the Tile Controllers. Here is the second version.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Source: src/profilers/MyAspect2.aj
package profilers;

import java.util.UUID;

import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.commons.lang.StringUtils;
import org.apache.struts.tiles.ComponentContext;

/**
 * Tile Profiler for batch use. We just capture elapsed times for
 * each tile as a CSV formatted string in the logs. The logs need to be 
 * post-processed to get profiling information.
 */
public aspect MyAspect2 {

  private static final String ATTR_REQ_UUID = "_req_uuid";
  private static final String ATTR_START_TIME = "_start_time";
  private static final String PROFILER_NAME = MyAspect2.class.getSimpleName();
  
  pointcut handleRequestInternal(HttpServletRequest request, 
    HttpServletResponse response) :
    execution(public * com.myco.myapp.MyController.handleRequestInternal(..))
    && args(request, response);

  pointcut doPerform(ComponentContext context, HttpServletRequest request, 
      HttpServletResponse response) : 
    execution(public * com.myco.myapp.tiles..*.doPerform(..))
    && args(context, request, response);

  before(HttpServletRequest request, HttpServletResponse response) : 
      handleRequestInternal(request, response) {
    UUID uuid = UUID.randomUUID();
    request.setAttribute(ATTR_REQ_UUID, uuid.toString());
  }
  
  before(ComponentContext context, HttpServletRequest request, 
      HttpServletResponse response) : doPerform(context, request, response) {
    Long startTime = System.currentTimeMillis();
    request.setAttribute(ATTR_START_TIME, startTime);
  }
  
  after(ComponentContext context, HttpServletRequest request, 
      HttpServletResponse response) : doPerform(context, request, response) {
    Long endTime = System.currentTimeMillis();
    Long startTime = (Long) request.getAttribute(ATTR_START_TIME);
    String requestUuid = (String) request.getAttribute(ATTR_REQ_UUID);
    String query = request.getParameter("q1");
    System.out.println(StringUtils.join(new String[] {
      PROFILER_NAME, requestUuid, query, 
      thisJoinPoint.getTarget().getClass().getSimpleName(),
      String.valueOf(endTime - startTime)
    }, "|"));
  }
}

Running this with a batch of URLs produces output in the Tomcat logs, which can be grepped using the PROFILER_NAME into a text file. You could then feed it into your data analysis tool of choice and slice and dice the data to produce actionable information. Here is some sample data from the file (I added the header line to make it a bit easier to read):

1
2
3
4
5
6
# aspect_name|request_uuid|query|tile_name|elapsed_time_millis
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileAController|132
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileBController|49
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileCController|2
MyAspect2|ed9ce777-263e-4fe5-a8af-4ea84d78add3|some query|TileDController|3
...

Compiling and Weaving

The aspect source code needs to be compiled and woven into the web application's JAR file. You will need all the JARs referenced by the web application in the classpath when you do this. I did this manually by making a copy of the build.xml or pom.xml, then creating a CLASSPATH declaration, then putting it into a bash script that looks something like this:

1
2
3
4
5
6
7
8
#!/bin/bash
CP=$ASPECTJ_HOME/lib/aspectjrt.jar:\
$ASPECTJ_HOME/lib/aspectjweaver.jar:\
#other classes in your web application's CLASSPATH
...

ajc -classpath $CP -1.6 -inpath myapp.jar -outjar myapp-woven.jar \
  src/profilers/MyAspect1.aj

Here myapp.jar is the JAR file for my web application myapp, and the output of this script is myapp-woven.jar, which you must copy to $TOMCAT_HOME/webapps/myapp/WEB-INF/lib.

References

I found these resources very helpful as I was developing the aspects described in this post.

  • AspectJ Project Documentation: I mostly read the Getting Started Guide, which gave me enough background to start reading/writing basic AspectJ code.
  • AspectJ Cookbook: I bought this book couple years ago hoping to use it to learn AspectJ, but did not have a problem to solve, so it ended up gathering dust. The shell script and the two aspects described are heavily based on recipes in this book. Better late than never, I guess :-).

Conclusion

So anyway, here is a poor man's profiler which you can whip up quickly if you know AspectJ well enough (it took me a while for the first one, but not so much time for the second one). Its also light on resources. My experience with profilers is that it takes a very long time to set up and run, since its busy collecting all sorts of information which you would probably never use. Knowing AspectJ gives you the ability to profile code in a very focused manner - definitely a useful skill to have at your disposal in my opinion.

Saturday, December 04, 2010

A PKI based CAPTCHA Implementation

I'm thinking of implementing CAPTCHA verification for an upcoming application, so I decided to check out JCaptcha, which provides libraries for image and sound CAPTCHAs, as well as an implementation you can directly use in your web application. Based on my understanding of the code from its 5-minute tutorial, the way it works is that it generates a random word, sticks it into the session and renders an image of the random word on the browser. When the user types the word back in, it matches the value in the session against the user input.

Most production systems are load balanced across a bank of servers. In such cases, the JCaptcha wiki either advises using either a distributed singleton or sticky sessions. This is because the session is used to hold on to the original generated word.

An alternative approach that eliminates the need for sessions would be to carry an encrypted version of the original word as a hidden field in the CAPTCHA challege form. Various encryption approaches could be used:

  • One-way hash - the original word is encrypted with something like MD5 and put into the hidden field. On the way back, the user's response is also encrypted and the digests compared.

  • Symmetric two-way hash - the original word is encrypted with something like DES and put into the hidden field. On the way back, we can either decrypt the hidden field back and compare to the user input, or encrypt the user input and compare with the hidden field.

  • Asymetric two-way hash - the original word is encrypted with the public part of a RSA keypair (in my case), and on the way back, decrypted with the private part, and compared to the user input.

I guess that any of these approaches would be secure enough to serve as a CAPTCHA implementation, but I hadn't used the Java PKI API before, and I figured that this would be a good time to learn. The code presented here includes a service (that manages the keypair and does the encryption and decryption), a Spring controller (that calls a JCaptcha WordToImage implementation to generate a CAPTCHA image, and methods to show and submit a CAPTCHA challenge form), and a JSPX file that can be used as part of a Spring Roo application.

Service

Asymmetric two-way hashes (around which the PKI API is based) are based on the premise that:

  d(e(m, q), p) == m

  where:
    m = the message body.
    p = private part of a keypair {p,q}.
    q = public part of a keypair {p,q}.
    d = decoding function.
    e = encoding function.

In typical usage, user A sends a message to user B with keypair {p,q} by encrypting it with B's public key q who then decrypts it with his own private key p In our case, there is only keypair, which belongs to the server. On the way out, the generated word is encrypted with the server's public key, and on the way back, the encrypted value in the hidden field is decrypted back with the server's private key.

At startup, each server loads up its keypair from a pair of key files, deployed as part of the web application. That way each server decrypts the encrypted key exactly the same way. Also, since the encrypted version is carried through the request-response cycle, a CAPTCHA generated on any one machine in the cluster can be verified on any other machine.

To generate the word, the service generates a 32 character MD5 checksum of the Date string at the instant the form is called from the browser, then pull a random substring from it between 3 to 5 characters in length. Here's the code for the service:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
// Source: src/main/java/com/mycompany/ktm/security/CaptchaCryptoService.java
package com.mycompany.ktm.security;

import java.io.File;
import java.security.KeyFactory;
import java.security.PrivateKey;
import java.security.PublicKey;
import java.security.spec.PKCS8EncodedKeySpec;
import java.security.spec.X509EncodedKeySpec;
import java.util.Date;

import javax.crypto.Cipher;

import org.apache.commons.codec.binary.Hex;
import org.apache.commons.io.FileUtils;
import org.apache.log4j.Logger;
import org.springframework.stereotype.Service;
import org.springframework.util.DigestUtils;

@Service("captchaCryptoService")
public class CaptchaCryptoService {

  private final Logger logger = Logger.getLogger(getClass());
  
  private static final File PRIV_KEY_FILE = 
    new File("src/main/resources/captcha.pri");
  private static final File PUB_KEY_FILE = 
    new File("src/main/resources/captcha.pub");
  
  private Cipher cipher;
  private PublicKey publicKey;
  private PrivateKey privateKey;
  
  public CaptchaCryptoService() {
    try {
      byte[] pri = FileUtils.readFileToByteArray(PRIV_KEY_FILE);
      PKCS8EncodedKeySpec priSpec = new PKCS8EncodedKeySpec(pri);
      KeyFactory priKeyFactory = KeyFactory.getInstance("RSA");
      this.privateKey = priKeyFactory.generatePrivate(priSpec);
      byte[] pub = FileUtils.readFileToByteArray(PUB_KEY_FILE);
      X509EncodedKeySpec pubSpec = new X509EncodedKeySpec(pub);
      KeyFactory pubKeyFactory = KeyFactory.getInstance("RSA");
      this.publicKey = pubKeyFactory.generatePublic(pubSpec);
      this.cipher = Cipher.getInstance("RSA/ECB/PKCS1Padding");
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }
  
  public String getEncodedCaptchaString() {
    byte[] currentTime = new Date().toString().getBytes();
    String md5 = DigestUtils.md5DigestAsHex(currentTime);
    int start = random(0, 32, 100);
    int length = random(3, 5, 10);
    String plainText = slice(md5, start, length);
    return encrypt(plainText);
  }

  public String getDecodedCaptchaString(String encoded) {
    String decoded = decrypt(encoded);
    return decoded;
  }
  
  public boolean validate(String encoded, String response) {
    return response.equalsIgnoreCase(getDecodedCaptchaString(encoded));
  }

  private boolean keyFilesExist() {
    return PRIV_KEY_FILE.exists() && PUB_KEY_FILE.exists();
  }

  private String encrypt(String plainText) {
    try {
      cipher.init(Cipher.ENCRYPT_MODE, this.publicKey);
      byte[] encrypted = cipher.doFinal(plainText.getBytes());
      Hex hex = new Hex();
      return new String(hex.encode(encrypted));
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }

  private String decrypt(String encodedText) {
    try {
      cipher.init(Cipher.DECRYPT_MODE, this.privateKey);
      Hex hex = new Hex();
      byte[] encrypted = hex.decode(encodedText.getBytes());
      byte[] decrypted = cipher.doFinal(encrypted);
      return new String(decrypted);
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
  }
  
  private int random(int min, int max, int multiplier) {
    while (true) {
      double rand = Math.round(Math.random() * multiplier);
      if (rand >= min && rand <= max) {
        return (int) rand;
      }
    }
  }

  private String slice(String str, int start, int length) {
    StringBuilder buf = new StringBuilder();
    int pos = start;
    for (int i = 0; i < length; i++) {
      if (pos == str.length()) {
        pos = 0;
      }
      buf.append(str.charAt(pos));
      pos++;
    }
    return buf.toString();
  }
}

To generate the key pair, I just used the following Java code, although you can also use ssh-keygen or something similar to generate them from the command line.

1
2
3
4
5
6
7
8
9
    KeyPairGenerator keyPairGenerator = KeyPairGenerator.getInstance("RSA");
    keyPairGenerator.initialize(1024);
    KeyPair keyPair = keyPairGenerator.generateKeyPair();
    FileUtils.writeByteArrayToFile(
      PRIV_KEY_FILE, keyPair.getPrivate().getEncoded());
    FileUtils.writeByteArrayToFile(
      PUB_KEY_FILE, keyPair.getPublic().getEncoded());
    this.privateKey = keyPair.getPrivate();
    this.publicKey = keyPair.getPublic();

Controller

I want to add the CAPTCHA into a Spring Roo application (which does not exist at the moment), so I just tried to add it into an existing app to test out the code. The controller code consists of three methods:

  1. Load the CAPTCHA form - in real-life, this would be a create/edit form for a domain object with the CAPTCHA stuff added to it. For testing, all it does is show the image and offer a single text box.
  2. Generate the CAPTCHA image - Generates the image using JCaptcha's WordToImage implementation and writes to as a PNG file.
  3. Handle the CAPTCHA form submit - as before, this would be part of another form in real-life. This takes the encrypted string and the user input and validates it against each other.
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
// Source: src/main/java/com/mycompany/ktm/web/CaptchaController.java
package com.mycompany.ktm.web;

import java.awt.Color;
import java.awt.image.BufferedImage;
import java.net.BindException;

import javax.annotation.PostConstruct;
import javax.imageio.ImageIO;
import javax.servlet.ServletOutputStream;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Logger;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.validation.BindingResult;
import org.springframework.validation.ObjectError;
import org.springframework.web.bind.ServletRequestUtils;
import org.springframework.web.bind.annotation.ModelAttribute;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.servlet.ModelAndView;

import com.mycompany.ktm.fbo.CaptchaForm;
import com.mycompany.ktm.security.CaptchaCryptoService;
import com.octo.captcha.component.image.backgroundgenerator.EllipseBackgroundGenerator;
import com.octo.captcha.component.image.fontgenerator.TwistedRandomFontGenerator;
import com.octo.captcha.component.image.textpaster.SimpleTextPaster;
import com.octo.captcha.component.image.wordtoimage.ComposedWordToImage;
import com.octo.captcha.component.image.wordtoimage.WordToImage;

@RequestMapping(value="/captcha/**")
@Controller
public class CaptchaController {

  private final Logger logger = Logger.getLogger(getClass());
  
  @Autowired private CaptchaCryptoService captchaCryptoService;

  private WordToImage imageGenerator;
  
  @ModelAttribute("captchaForm")
  public CaptchaForm formBackingObject() {
    return new CaptchaForm();
  }
  
  @PostConstruct
  protected void init() throws Exception {
    this.imageGenerator = new ComposedWordToImage(
      new TwistedRandomFontGenerator(40, 50),
      new EllipseBackgroundGenerator(150, 75),
      new SimpleTextPaster(3, 5, Color.YELLOW)
    );
  }
  
  @RequestMapping(value = "/captcha/form", method = RequestMethod.GET)
  public ModelAndView form(
      @RequestParam(value="ec", required=false) String ec,
      @ModelAttribute("captchaForm") CaptchaForm form,
      BindingResult result) {
    ModelAndView mav = new ModelAndView();
    if (StringUtils.isEmpty(ec)) {
      form.setEc(captchaCryptoService.getEncodedCaptchaString());
    } else {
      result.addError(new ObjectError("dc", "Bzzt! You missed! Try again!"));
      form.setEc(ec);
    }
    mav.addObject("form", form);
    mav.addObject("ec", form.getEc());
    mav.setViewName("captcha/create");
    return mav;
  }

  @RequestMapping(value="/captcha/image", method=RequestMethod.GET)
  public ModelAndView image(HttpServletRequest req, HttpServletResponse res)
      throws Exception {
    String encoded = ServletRequestUtils.getRequiredStringParameter(req, "ec");
    String decoded = captchaCryptoService.getDecodedCaptchaString(encoded);
    BufferedImage image = imageGenerator.getImage(decoded);
    res.setHeader("Cache-Control", "no-store");
    res.setHeader("Pragma", "no-cache");
    res.setDateHeader("Expires", 0);
    res.setContentType("image/png");
    ServletOutputStream ostream = res.getOutputStream();
    ImageIO.write(image, "png", ostream);
    ostream.flush();
    ostream.close();
    return null;
  }

  @RequestMapping(value = "/captcha/validate", method = RequestMethod.POST)
  public ModelAndView validate(
      @ModelAttribute("captchaForm") CaptchaForm form,
      BindingResult result) {
    ModelAndView mav = new ModelAndView();
    String ec = form.getEc();
    String dc = form.getDc();
    if (captchaCryptoService.validate(ec, dc)) {
      mav.setViewName("redirect:/captcha/form");
    } else {
      mav.setViewName("redirect:/captcha/form?ec=" + ec);
    }
    return mav;
  }
}

The form backing object is a plain POJO with only two fields. If we are using a domain object as our form backing object, then I think we can just add these two fields as @Transient, although I haven't tried it. Theres not much to this POJO, I show it below, with getters and setters omitted.

1
2
3
4
5
6
7
8
9
// Source: src/main/java/com/mycompany/ktm/fbo/CaptchaForm.java
package com.mycompany.ktm.fbo;

public class CaptchaForm {

  private String ec; // encrypted string
  private String dc; // plaintext string
  ...  
}

And finally, the JSPX file. This is a copy of one of the Roo generated JSPX files, with the image generated using an <img> tag, a single text field to get the user input, and a submit button.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
<div xmlns:c="http://java.sun.com/jsp/jstl/core" 
     xmlns:form="http://www.springframework.org/tags/form"
     xmlns:jsp="http://java.sun.com/JSP/Page"
     xmlns:spring="http://www.springframework.org/tags" version="2.0">
  <jsp:output omit-xml-declaration="yes"/>
  <script type="text/javascript">
    dojo.require('dijit.TitlePane');
  </script>
  <div id="_title_div">
    <spring:message var="title_msg" text="Captcha Challenge"/>
    <script type="text/javascript">
      Spring.addDecoration(new Spring.ElementDecoration({
        elementId : '_title_div', 
        widgetType : 'dijit.TitlePane', 
        widgetAttrs : {title: '${title_msg}'}
      })); 
    </script>
    <form:form action="/ktm/captcha/validate" method="POST" 
        modelAttribute="captchaForm">
      <form:errors cssClass="errors" delimiter="&lt;p/&gt;"/>
      <div id="roo_captcha_image">
        <img src="/ktm/captcha/image?ec=${ec}"/>
      </div>
      <div id="roo_captcha_dc">
        <label for="_captcha_dc">
          Enter the characters you see in the image above:
        </label>
        <form:input cssStyle="width:250px" id="_captcha_dc" 
          maxlength="5" path="dc" size="0"/>
        <br/>
        <form:errors cssClass="errors" id="_dc_error_id" path="dc"/>
      </div>
      <div class="submit" id="roo_captcha_submit">
        <spring:message code="button.save" var="save_button"/>
        <script type="text/javascript">
          Spring.addDecoration(new Spring.ValidateAllDecoration({
            elementId:'proceed', 
            event:'onclick'
          }));
        </script>
        <input id="proceed" type="submit" value="${save_button}"/>
      </div>
      <form:hidden id="_roo_captcha_ec" path="ec"/>
    </form:form>
  </div>
</div>

I also had to create a new views.xml for this controller, similar to other Roo custom controllers. It only has a single entry, here it is:

1
2
3
4
5
6
7
8
9
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE tiles-definitions 
  PUBLIC "-//Apache Software Foundation//DTD Tiles Configuration 2.1//EN" 
  "http://tiles.apache.org/dtds/tiles-config_2_1.dtd">
<tiles-definitions>
  <definition extends="default" name="captcha/create">
    <put-attribute name="body" value="/WEB-INF/views/captcha/create.jspx"/>
  </definition>
</tiles-definitions>

Putting it all together

As I said before, a "real" use-case of this would be to embed it into an edit form for a domain object - here, I am only interested in testing the functionality, so its being used standalone. Here are some screenshots demonstrating the flow:

Entering /captcha/form in my browser would take me to the challenge form with a CAPTCHA challenge image generated:

If I enter an incorrect value, it shows me an error message and regenerates the same string into a different CAPTCHA image, hopefully more readable this time. This is not the standard behavior for most CAPTCHA's I've used, they generally give you a new image of a different string in these cases. Notice that the URL now has the ec parameter which contains the encrypted string.

If I enter a correct value, it is silently accepted and the form is regenerated with a new CAPTCHA challenge.

References

Apart from the references above, I found the following links very helpful. All the APIs here (except Spring and Roo) are new to me, so I had to do a bit of reading before everything came together.

  • Bouncy Castle Crypto package - contains lots of code examples to get you started quickly with this encryption stuff. My CaptchaCryptoService is based heavily on the PublicExample.java here.

  • Using Public Key encryption in Java - contains general information about PKI and Java. I found out how to store keypairs into files and load them from my service.
  • JCaptcha and Spring Framework - contains a discussion on how to configure JCaptcha beans with Spring. I ended up building my WordToImage implementation in the Controller's @PostConstruct method, but the information in here was invaluable in figuring out whats available.

Thursday, November 25, 2010

Modeling Hierarchical Data with RSS and ROME

Overview

Our API provides a set of REST services over RSS 2.0. One of them is a content service, which allows you to call up a piece of content by URL or ID. So far, we have been serving single pieces of content - recently the need came up for serving server generated mashups which would be presented to an external client, who would then restyle and rearrange the components as they see fit. This means that we have to serve multiple pieces of content that form a coherent whole - in other words, structured hierarchical data.

The RSS 2.0 Specification does not provide any support for hierarchical data. I briefly describe the approaches that occurred to me, then focus on the one I liked the most, which I also suggested in response to a question on StackOverflow. Here they are:

  1. Serve the entire content inside a content:encoded tag.
  2. Use the old style content tags.
  3. Write a custom ROME module to support inlined item hierarchies.
  4. Provide structure information as a category tag.

The first approach is one we already use for our current content service. While the assumption is that the client would use the content as-is on their site, the content is provided as XHTML, so the client can build an XML parser to parse the content if they so desired.

I briefly looked at the old-style content tags, but the embedded RDF tags within the content:items and content:item tags appeared quite confusing. They seem to be used only in RSS 1.0 however, so not quite sure if they can even be used (via ROME) with RSS 2.0. Also, I kind of lost interest when I realized that you can only model a single level with this approach.

I also briefly dallied with the idea of writing my own custom module which would allow nesting of items within items. It would have been interesting to do, but it seemed like a lot of unnecessary work. Moreover, the resulting XML would not have rendered in a standard browser, and would have needed custom parsing on the client side as well. There is a proposal to extend Atom to do something similar, so the idea is probably not that far-fetched.

The last idea (which I am going to expand on) is to return each component of the mashup as a separate item, each with its own content:encoded tag containing the XHTML content for that component. Each item has a category tag with an XPath like value that indicates "where" this component is on the page. The nice thing with this approach is that this requires no additional modules, and the output is renderable in a web browser.

A Concrete Example

As an example, lets see how this approach would play out with a service that provided RSS feeds of pages on this blog. These pages consist of various components or sections - the dek, the Post component, the Tag Cloud component, the main content component, etc. The RSS for this would look something like this:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" 
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:media="http://search.yahoo.com/mrss/"
     xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
     version="2.0">
  <channel>
    <title>My feed title</title>
    <link>http://www.foo.com/path/to/hypothetical/feed/service</link>
    <description>Test for Hierarchical Data in RSS Feed</description>
    <language>en_US</language>
    <dc:language>en_US</dc:language>
    <item>
      <title>Scripting Luke with Javascript</title>
      <link>http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html</link>
      <description>For quite some time now, I have been looking...</description>
      <category domain="path">/</category>
      <guid>http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html</guid>
    </item>
    <item>
      <title>Salmon Run</title>
      <category domain="path">/dek</category>
      <content:encoded><![CDATA[Swimming upstream...blah, blah, blah]]></content:encoded>
    </item>
    <item>
      <title>Posts</title>
      <category domain="path">/toc</category>
      <content:encoded><![CDATA[<div class="widget-content">...</div>]]></content:encoded>
    </item>
    <item>
      <title>Labels</title>
      <category domain="path">/cloud</category>
      <content:encoded><![CDATA[<div class='widget-content cloud-label-widget-content'>...</div>]]></content:encoded>
    </item>
    <item>
      <title>About me</title>
      <link>http://profiles.blogspot.com/path/to/my/complete/profile.html</link>
      <category domain="path">/section</category>
      <guid>http://profiles.blogspot.com/path/to/my/complete/profile.html</guid>
      <content:encoded><![CDATA[I am a programmer...blah, blah, blah]]></content:encoded>
      <media:content url="http://img.blogspot.com/path/to/my/picture.jpg" />
    </item>
    <item>
      <title>Scripting Luke with Javascript</title>
      <link>http://sujitpal.blogspot.com/2010/11/scripting-luke.html</link>
      <category domain="path">/content</category>
      <guid>http://sujitpal.blogspot.com/2010/11/scripting-luke.html</guid>
      <content:encoded><![CDATA[<p>For quite some time now, I have been looking...</p>]]></content:encoded>
    </item>
    <item>
      <title>Share/Save</title>
      <category domain="path">/labels</category>
      <content:encoded><![CDATA[javascript, lucene, luke, scripting, search]]></content:encoded>
    </item>
    <item>
      <title>2010</title>
      <link>http://sujitpal.blogspot.com/search?updated-min=2010-01-01T00%3A00%3A00-08%3A00&amp;updated-max=2011-01-01T00%3A00%3A00-08%3A00&amp;max-results=33</link>
      <category domain="path">/toc/2010</category>
      <guid>http://sujitpal.blogspot.com/search?updated-min=2010-01-01T00%3A00%3A00-08%3A00&amp;updated-max=2011-01-01T00%3A00%3A00-08%3A00&amp;max-results=33</guid>
    </item>
    <item>
      <title>November</title>
      <link>http://sujitpal.blogspot.com/2010_11_01_archive.html</link>
      <category domain="path">/toc/2010/11</category>
      <guid>http://sujitpal.blogspot.com/2010_11_01_archive.html</guid>
    </item>
    <item>
      <title>Scripting Luke with Javascript</title>
      <link>http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html</link>
      <category domain="path">/toc/2010/11/03</category>
      <guid>http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html</guid>
    </item>
    <item>
      <title>March</title>
      <link>http://sujitpal.blogspot.com/2010_03_01_archive.html</link>
      <category domain="path">/toc/2010/03</category>
      <guid>http://sujitpal.blogspot.com/2010_03_01_archive.html</guid>
    </item>
    <item>
      <title>Java REST Client for Ehcache server</title>
      <link>http://sujitpal.blogspot.com/2010/03/java-rest-client-interface-for-ehcache.html</link>
      <category domain="path">/toc/2010/03/03</category>
      <guid>http://sujitpal.blogspot.com/2010/03/java-rest-client-interface-for-ehcache.html</guid>
    </item>
    <item>
      <title>2009</title>
      <link>http://sujitpal.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00-08%3A00&amp;updated-max=2010-01-01T00%3A00%3A00-08%3A00&amp;max-results=44</link>
      <category domain="path">/toc/2009</category>
      <guid>http://sujitpal.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00-08%3A00&amp;updated-max=2010-01-01T00%3A00%3A00-08%3A00&amp;max-results=44</guid>
    </item>
    <item>
      <title>December</title>
      <category domain="path">/toc/2009/12</category>
    </item>
    <item>
      <title>A Lucene POS Tagging Tokenfilter</title>
      <link>http://sujitpal.blogspot.com/2009/12/lucene-pos-tagging-tokenfilter.html</link>
      <category domain="path">/toc/2009/12/02</category>
      <guid>http://sujitpal.blogspot.com/2009/12/lucene-pos-tagging-tokenfilter.html</guid>
    </item>
  </channel>
</rss>

The above XML was created from a bean which was manually populated with values. The XML generation code is not particularly interesting, but it is included below as part of the full code for this post. Real-life code would probably call multiple backing services to generate the bean, which would then be rendered into RSS by ROME.

Parsing this XML to extract the components is almost trivially simple. In this case, we are interested in the top level components, ie, those at level 1 of the tree. Our parsing code looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  @Test
  public void testParseSectionStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new SectionFilterPredicate());
    print(entries, false);
  }

  private class SectionFilterPredicate implements Predicate<SyndEntry> {
    @Override public boolean evaluate(SyndEntry entry) {
      String path = getCategoryValue(entry, "path");
      return StringUtils.split(path, "/").length == 1;
    }
  }

and results in the following list. The print() method above only prints the title and the path for this test, but obviously the client can do whatever he wants with the SyndEntry object.

1
2
3
4
5
6
    [junit] Salmon Run (/dek)
    [junit] Posts (/toc)
    [junit] Labels (/cloud)
    [junit] About me (/section)
    [junit] Scripting Luke with Javascript (/content)
    [junit] Share/Save (/labels)

What about multi-level elements such as the Posts element? Well, depending on whether the client will simply dump the component on his page or change it around, we can either provide the XHTML in a content:encoded tag or break it up into individual items. We have chosen to do both here, for illustration. To extract only the navigation elements, the parsing code could do something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
  @Test
  public void testParseNavigationStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new NavigationFilterPredicate());
    Collections.sort(entries, new PathComparator());
    print(entries, true);
  }

  private class NavigationFilterPredicate implements Predicate<SyndEntry> {
    @Override public boolean evaluate(SyndEntry entry) {
      String path = getCategoryValue(entry, "path");
      return path.startsWith("/toc") && 
        StringUtils.split(path, "/").length > 1;
    }
  }

  private class PathComparator implements Comparator<SyndEntry> {
    @Override public int compare(SyndEntry entry1, SyndEntry entry2) {
      String[] paths1 = 
        StringUtils.split(getCategoryValue(entry1, "path"), "/");
      String[] paths2 = 
        StringUtils.split(getCategoryValue(entry2, "path"), "/");
      Integer y1 = paths1.length < 2 ? 
        Integer.MAX_VALUE : Integer.valueOf(paths1[1]);
      Integer y2 = paths2.length < 2 ? 
        Integer.MAX_VALUE : Integer.valueOf(paths2[1]);
      if (y1 == y2) {
        Integer m1 = paths1.length < 3 ? 
          Integer.MAX_VALUE : Integer.valueOf(paths1[2]);
        Integer m2 = paths2.length < 3 ? 
          Integer.MAX_VALUE : Integer.valueOf(paths2[2]);
        if (m1 == m2) {
          Integer d1 = paths1.length < 4 ? 
            Integer.MAX_VALUE : Integer.valueOf(paths1[3]);
          Integer d2 = paths2.length < 4 ? 
            Integer.MAX_VALUE : Integer.valueOf(paths2[3]);
          if (d1 == d2) {
            String title1 = entry1.getTitle();
            String title2 = entry2.getTitle();
            return title1.compareTo(title2);
          } else {
            return d2.compareTo(d1);
          }
        } else {
          return m2.compareTo(m1);
        }
      } else {
        return y2.compareTo(y1);
      }
    }
  }

Which results in the following output. Note that the sorting is really not required if the server lists them out in the correct order, I only add it in as an example of how simple it is. In our case, the sorting was a bit more complex since we have ordering information built into the path components and we need to sort in reverse chronological order - in most cases, a simple lexical sort would be all that would be required, even if the server did not return them in the correct order (which would be a really strange and non-intuitive thing for the server to not do).

1
2
3
4
5
6
7
8
    [junit] ..2010 (/toc/2010)
    [junit] ...November (/toc/2010/11)
    [junit] ....Scripting Luke with Javascript (/toc/2010/11/03)
    [junit] ...March (/toc/2010/03)
    [junit] ....Java REST Client for Ehcache server (/toc/2010/03/03)
    [junit] ..2009 (/toc/2009)
    [junit] ...December (/toc/2009/12)
    [junit] ....A Lucene POS Tagging Tokenfilter (/toc/2009/12/02)

Equally simple is the ability to manipulate the structure. For example, the client may want to only show the navigation elements for the current year posts. To do that, the code would look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  @Test
  public void testParseNavigationStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new AndPredicate(
      new NavigationFilterPredicate(), new CurrentYearFilterPredicate()));
    print(entries, true);
  }

  private class CurrentYearFilterPredicate implements Predicate<SyndEntry> {

    private int currentYear;
    
    public CurrentYearFilterPredicate() {
      Calendar cal = Calendar.getInstance();
      this.currentYear = cal.get(Calendar.YEAR);
    }
    
    @Override public boolean evaluate(SyndEntry entry) {
      String[] pathElements = 
        StringUtils.split(getCategoryValue(entry, "path"), "/");
      if (pathElements.length >= 2) {
        return currentYear == Integer.valueOf(pathElements[1]);
      }
      return false;
    }
  }

Which results in the following output.

1
2
3
4
5
    [junit] ..2010 (/toc/2010)
    [junit] ...November (/toc/2010/11)
    [junit] ....Scripting Luke with Javascript (/toc/2010/11/03)
    [junit] ...March (/toc/2010/03)
    [junit] ....Java REST Client for Ehcache server (/toc/2010/03/03)

Conclusion

I personally think that the approach of "marking up" RSS item tags to indicate the location is quite simple and elegant. It does not require any extension or modification of the RSS specification, just a shared understanding about a single tag between server and client. But since I came up with it, I'm probably not being totally objective.

In my example, I have used ROME on both the server (XML generation) and client (XML parsing) ends to build and parse the RSS feed (as I would in real life too). Obviously, there is no guarantee that clients would use ROME (although it would simplify their lives enormously if they did). But the parsing stuff I talked about above can be done very easily with almost any XML parsing toolkit.

What do you think? Do you think the burden imposed on this client with this approach is excessive? Would it be better to come up with a custom microformat that handles this sort of data instead of using RSS? Would you have tackled the problem differently? Would appreciate hearing your concerns or if you have better/different ways of solving this problem.

Full Code

The full code for the above is modelled as a single JUnit test class, which is shown below. The interesting parts of the code has already been discussed above, this is here only for completeness.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
package com.mycompany.feeds.hierarchicaldata;

import java.io.File;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

import org.apache.commons.collections15.CollectionUtils;
import org.apache.commons.collections15.Predicate;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

import com.sun.syndication.feed.WireFeed;
import com.sun.syndication.feed.module.content.ContentModule;
import com.sun.syndication.feed.module.content.ContentModuleImpl;
import com.sun.syndication.feed.module.mediarss.MediaEntryModuleImpl;
import com.sun.syndication.feed.module.mediarss.types.MediaContent;
import com.sun.syndication.feed.module.mediarss.types.UrlReference;
import com.sun.syndication.feed.synd.SyndCategory;
import com.sun.syndication.feed.synd.SyndCategoryImpl;
import com.sun.syndication.feed.synd.SyndContent;
import com.sun.syndication.feed.synd.SyndContentImpl;
import com.sun.syndication.feed.synd.SyndEntry;
import com.sun.syndication.feed.synd.SyndEntryImpl;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.feed.synd.SyndFeedImpl;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.WireFeedOutput;

public class HierarchyDataTest {

  private final Log log = LogFactory.getLog(getClass());
  
  @Test
  public void testBuild() throws Exception {
    Page page = buildTestPage();
    SyndFeed feed = buildEmptyFeed();
    // main entry
    SyndEntry mainEntry = new SyndEntryImpl();
    mainEntry.setTitle(page.title);
    mainEntry.setLink(page.link);
    setDescription(mainEntry, page.description);
    addCategory(mainEntry, "path", "/");
    addEntry(feed, mainEntry);
    // sections
    for (Section section : page.sections) {
      SyndEntry entry = new SyndEntryImpl();
      entry.setTitle(section.title);
      entry.setLink(section.link);
      setContent(entry, section.content);
      setImage(entry, section.imageUrl);
      addCategory(entry, "path", section.path);
      addEntry(feed, entry);
    }
    WireFeedOutput output = new WireFeedOutput();
    WireFeed wirefeed = feed.createWireFeed();
    FileUtils.writeStringToFile(new File("/tmp/hierarchy_test.xml"), 
      output.outputString(wirefeed), "UTF-8");
  }
  
  @Test
  public void testParseSectionStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new SectionFilterPredicate());
    print(entries, false);
  }

  @Test
  public void testParseNavigationStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new NavigationFilterPredicate());
    Collections.sort(entries, new PathComparator());
    System.out.println("All navigation elements");
    print(entries, true);
    // show only current year navigation
    System.out.println("Current year navigation elements");
    CollectionUtils.filter(entries, new CurrentYearFilterPredicate());
    print(entries, true);
  }

  // ==== bean building methods ====
  
  private Page buildTestPage() {
    Page page = new Page();
    page.link = "http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html";
    page.title = "Scripting Luke with Javascript";
    page.description = "For quite some time now, I have been looking...";
    // dek
    Section dek = new Section();
    dek.title = "Salmon Run";
    dek.content = "Swimming upstream...blah, blah, blah";
    dek.path = "/dek";
    page.sections.add(dek);
    // table of contents
    Section toc = new Section();
    toc.title = "Posts";
    toc.content = "<div class=\"widget-content\">...</div>";
    toc.path = "/toc";
    page.sections.add(toc);
    // cloud
    Section cloud = new Section();
    cloud.title = "Labels";
    cloud.content = "<div class='widget-content cloud-label-widget-content'>...</div>";
    cloud.path = "/cloud";
    page.sections.add(cloud);
    // profile
    Section profile = new Section();
    profile.title = "About me";
    profile.content = "I am a programmer...blah, blah, blah";
    profile.imageUrl = "http://img.blogspot.com/path/to/my/picture.jpg";
    profile.link = "http://profiles.blogspot.com/path/to/my/complete/profile.html";
    profile.path = "/section";
    page.sections.add(profile);
    // main content
    Section content = new Section();
    content.title = "Scripting Luke with Javascript";
    content.link = "http://sujitpal.blogspot.com/2010/11/scripting-luke.html";
    content.path = "/content";
    content.content = "<p>For quite some time now, I have been looking...</p>";
    page.sections.add(content);
    // labels
    Section labels = new Section();
    labels.title = "Share/Save";
    labels.path = "/labels";
    labels.content = "javascript, lucene, luke, scripting, search";
    page.sections.add(labels);
    // navigation elements
    Section nav1 = new Section();
    nav1.title = "2010";
    nav1.link = "http://sujitpal.blogspot.com/search?updated-min=2010-01-01T00%3A00%3A00-08%3A00&updated-max=2011-01-01T00%3A00%3A00-08%3A00&max-results=33";
    nav1.path = "/toc/2010";
    page.sections.add(nav1);
    Section nav11 = new Section();
    nav11.title = "November";
    nav11.link = "http://sujitpal.blogspot.com/2010_11_01_archive.html";
    nav11.path = "/toc/2010/11";
    page.sections.add(nav11);
    Section nav111 = new Section();
    nav111.title = "Scripting Luke with Javascript";
    nav111.link = "http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html";
    nav111.path = "/toc/2010/11/03";
    page.sections.add(nav111);
    Section nav12 = new Section();
    nav12.title = "March";
    nav12.link = "http://sujitpal.blogspot.com/2010_03_01_archive.html";
    nav12.path = "/toc/2010/03";
    page.sections.add(nav12);
    Section nav121 = new Section();
    nav121.title = "Java REST Client for Ehcache server";
    nav121.link = "http://sujitpal.blogspot.com/2010/03/java-rest-client-interface-for-ehcache.html";
    nav121.path = "/toc/2010/03/03";
    page.sections.add(nav121);
    Section nav2 = new Section();
    nav2.title = "2009";
    nav2.link = "http://sujitpal.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00-08%3A00&updated-max=2010-01-01T00%3A00%3A00-08%3A00&max-results=44";
    nav2.path = "/toc/2009";
    page.sections.add(nav2);
    Section nav21 = new Section();
    nav21.title = "December";
    nav21.path = "/toc/2009/12";
    page.sections.add(nav21);
    Section nav211 = new Section();
    nav211.title = "A Lucene POS Tagging Tokenfilter";
    nav211.link = "http://sujitpal.blogspot.com/2009/12/lucene-pos-tagging-tokenfilter.html";
    nav211.path = "/toc/2009/12/02";
    page.sections.add(nav211);
    // return the page bean
    return page;
  }

  private class Page {
    public String title;
    public String link;
    public String description;
    public List<Section> sections = new ArrayList<Section>();
  }
  
  private class Section {
    public String title;
    public String link;
    public String content;
    public String path;
    public String imageUrl;
  }
  
  // ==== feed building methods ====
  
  private SyndFeed buildEmptyFeed() {
    SyndFeed feed = new SyndFeedImpl();
    feed.setFeedType("rss_2.0");
    feed.setTitle("My feed title");
    feed.setDescription("Test for Hierarchical Data in RSS Feed");
    feed.setLink("http://www.foo.com/path/to/hypothetical/feed/service");
    feed.setLanguage("en_US");
    return feed;
  }

  private void addEntry(SyndFeed feed, SyndEntry entry) {
    List<SyndEntry> entries = feed.getEntries();
    if (entries == null) {
      entries = new ArrayList<SyndEntry>();
    }
    entries.add(entry);
  }
  
  private void setImage(SyndEntry entry, String imageUrl) {
    if (imageUrl != null) {
      try {
        MediaContent[] contents = new MediaContent[] {
          new MediaContent(new UrlReference(imageUrl))
        };
        MediaEntryModuleImpl module = new MediaEntryModuleImpl();
        module.setMediaContents(contents);
        entry.getModules().add(module);
      } catch (MalformedURLException e) {
        log.warn(e);
      }
    }
  }
  
  private void setContent(SyndEntry entry, String content) {
    if (content != null) {
      List<String> contents = new ArrayList<String>();
      contents.add(content);
      ContentModule module = new ContentModuleImpl();
      module.setEncodeds(contents);
      entry.getModules().add(module);
    }
  }
  
  private void setDescription(SyndEntry entry, String description) {
    SyndContent content = new SyndContentImpl();
    content.setValue(description);
    entry.setDescription(content);
  }
  
  private void addCategory(SyndEntry entry, String name, String value) {
    SyndCategory category = new SyndCategoryImpl();
    category.setName(value);
    category.setTaxonomyUri(name);
    entry.getCategories().add(category);
  }
  
  // ==== feed parsing methods ====
  
  private class SectionFilterPredicate implements Predicate<SyndEntry> {
    @Override public boolean evaluate(SyndEntry entry) {
      String path = getCategoryValue(entry, "path");
      return StringUtils.split(path, "/").length == 1;
    }
  }

  private class NavigationFilterPredicate implements Predicate<SyndEntry> {
    @Override public boolean evaluate(SyndEntry entry) {
      String path = getCategoryValue(entry, "path");
      return path.startsWith("/toc") && 
        StringUtils.split(path, "/").length > 1;
    }
  }

  private class CurrentYearFilterPredicate implements Predicate<SyndEntry> {

    private int currentYear;
    
    public CurrentYearFilterPredicate() {
      Calendar cal = Calendar.getInstance();
      this.currentYear = cal.get(Calendar.YEAR);
    }
    
    @Override public boolean evaluate(SyndEntry entry) {
      String[] pathElements = 
        StringUtils.split(getCategoryValue(entry, "path"), "/");
      if (pathElements.length >= 2) {
        return currentYear == Integer.valueOf(pathElements[1]);
      }
      return false;
    }
  }

  // if we need to change the implicit ordering
  private class PathComparator implements Comparator<SyndEntry> {

    @Override public int compare(SyndEntry entry1, SyndEntry entry2) {
      String[] paths1 = 
        StringUtils.split(getCategoryValue(entry1, "path"), "/");
      String[] paths2 = 
        StringUtils.split(getCategoryValue(entry2, "path"), "/");
      Integer y1 = paths1.length < 2 ? 
        Integer.MAX_VALUE : Integer.valueOf(paths1[1]);
      Integer y2 = paths2.length < 2 ? 
        Integer.MAX_VALUE : Integer.valueOf(paths2[1]);
      if (y1 == y2) {
        Integer m1 = paths1.length < 3 ? 
          Integer.MAX_VALUE : Integer.valueOf(paths1[2]);
        Integer m2 = paths2.length < 3 ? 
          Integer.MAX_VALUE : Integer.valueOf(paths2[2]);
        if (m1 == m2) {
          Integer d1 = paths1.length < 4 ? 
            Integer.MAX_VALUE : Integer.valueOf(paths1[3]);
          Integer d2 = paths2.length < 4 ? 
            Integer.MAX_VALUE : Integer.valueOf(paths2[3]);
          if (d1 == d2) {
            String title1 = entry1.getTitle();
            String title2 = entry2.getTitle();
            return title1.compareTo(title2);
          } else {
            return d2.compareTo(d1);
          }
        } else {
          return m2.compareTo(m1);
        }
      } else {
        return y2.compareTo(y1);
      }
    }
  }

  private String getCategoryValue(SyndEntry entry, String name) {
    List<SyndCategory> categories = entry.getCategories();
    String value = null;
    for (SyndCategory category : categories) {
      if (name.equals(category.getTaxonomyUri())) {
        value = category.getName();
        break;
      }
    }
    return value;
  }

  private void print(List<SyndEntry> entries, boolean indent) {
    for (SyndEntry entry : entries) {
      String path = getCategoryValue(entry, "path");
      System.out.println((indent ? 
        StringUtils.repeat(".", StringUtils.split(path, "/").length) : "") + 
        entry.getTitle() + " (" + path + ")");
    }
  }
}