Thursday, November 25, 2010

Modeling Hierarchical Data with RSS and ROME

Overview

Our API provides a set of REST services over RSS 2.0. One of them is a content service, which allows you to call up a piece of content by URL or ID. So far, we have been serving single pieces of content - recently the need came up for serving server generated mashups which would be presented to an external client, who would then restyle and rearrange the components as they see fit. This means that we have to serve multiple pieces of content that form a coherent whole - in other words, structured hierarchical data.

The RSS 2.0 Specification does not provide any support for hierarchical data. I briefly describe the approaches that occurred to me, then focus on the one I liked the most, which I also suggested in response to a question on StackOverflow. Here they are:

  1. Serve the entire content inside a content:encoded tag.
  2. Use the old style content tags.
  3. Write a custom ROME module to support inlined item hierarchies.
  4. Provide structure information as a category tag.

The first approach is one we already use for our current content service. While the assumption is that the client would use the content as-is on their site, the content is provided as XHTML, so the client can build an XML parser to parse the content if they so desired.

I briefly looked at the old-style content tags, but the embedded RDF tags within the content:items and content:item tags appeared quite confusing. They seem to be used only in RSS 1.0 however, so not quite sure if they can even be used (via ROME) with RSS 2.0. Also, I kind of lost interest when I realized that you can only model a single level with this approach.

I also briefly dallied with the idea of writing my own custom module which would allow nesting of items within items. It would have been interesting to do, but it seemed like a lot of unnecessary work. Moreover, the resulting XML would not have rendered in a standard browser, and would have needed custom parsing on the client side as well. There is a proposal to extend Atom to do something similar, so the idea is probably not that far-fetched.

The last idea (which I am going to expand on) is to return each component of the mashup as a separate item, each with its own content:encoded tag containing the XHTML content for that component. Each item has a category tag with an XPath like value that indicates "where" this component is on the page. The nice thing with this approach is that this requires no additional modules, and the output is renderable in a web browser.

A Concrete Example

As an example, lets see how this approach would play out with a service that provided RSS feeds of pages on this blog. These pages consist of various components or sections - the dek, the Post component, the Tag Cloud component, the main content component, etc. The RSS for this would look something like this:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" 
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
     xmlns:media="http://search.yahoo.com/mrss/"
     xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
     version="2.0">
  <channel>
    <title>My feed title</title>
    <link>http://www.foo.com/path/to/hypothetical/feed/service</link>
    <description>Test for Hierarchical Data in RSS Feed</description>
    <language>en_US</language>
    <dc:language>en_US</dc:language>
    <item>
      <title>Scripting Luke with Javascript</title>
      <link>http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html</link>
      <description>For quite some time now, I have been looking...</description>
      <category domain="path">/</category>
      <guid>http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html</guid>
    </item>
    <item>
      <title>Salmon Run</title>
      <category domain="path">/dek</category>
      <content:encoded><![CDATA[Swimming upstream...blah, blah, blah]]></content:encoded>
    </item>
    <item>
      <title>Posts</title>
      <category domain="path">/toc</category>
      <content:encoded><![CDATA[<div class="widget-content">...</div>]]></content:encoded>
    </item>
    <item>
      <title>Labels</title>
      <category domain="path">/cloud</category>
      <content:encoded><![CDATA[<div class='widget-content cloud-label-widget-content'>...</div>]]></content:encoded>
    </item>
    <item>
      <title>About me</title>
      <link>http://profiles.blogspot.com/path/to/my/complete/profile.html</link>
      <category domain="path">/section</category>
      <guid>http://profiles.blogspot.com/path/to/my/complete/profile.html</guid>
      <content:encoded><![CDATA[I am a programmer...blah, blah, blah]]></content:encoded>
      <media:content url="http://img.blogspot.com/path/to/my/picture.jpg" />
    </item>
    <item>
      <title>Scripting Luke with Javascript</title>
      <link>http://sujitpal.blogspot.com/2010/11/scripting-luke.html</link>
      <category domain="path">/content</category>
      <guid>http://sujitpal.blogspot.com/2010/11/scripting-luke.html</guid>
      <content:encoded><![CDATA[<p>For quite some time now, I have been looking...</p>]]></content:encoded>
    </item>
    <item>
      <title>Share/Save</title>
      <category domain="path">/labels</category>
      <content:encoded><![CDATA[javascript, lucene, luke, scripting, search]]></content:encoded>
    </item>
    <item>
      <title>2010</title>
      <link>http://sujitpal.blogspot.com/search?updated-min=2010-01-01T00%3A00%3A00-08%3A00&amp;updated-max=2011-01-01T00%3A00%3A00-08%3A00&amp;max-results=33</link>
      <category domain="path">/toc/2010</category>
      <guid>http://sujitpal.blogspot.com/search?updated-min=2010-01-01T00%3A00%3A00-08%3A00&amp;updated-max=2011-01-01T00%3A00%3A00-08%3A00&amp;max-results=33</guid>
    </item>
    <item>
      <title>November</title>
      <link>http://sujitpal.blogspot.com/2010_11_01_archive.html</link>
      <category domain="path">/toc/2010/11</category>
      <guid>http://sujitpal.blogspot.com/2010_11_01_archive.html</guid>
    </item>
    <item>
      <title>Scripting Luke with Javascript</title>
      <link>http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html</link>
      <category domain="path">/toc/2010/11/03</category>
      <guid>http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html</guid>
    </item>
    <item>
      <title>March</title>
      <link>http://sujitpal.blogspot.com/2010_03_01_archive.html</link>
      <category domain="path">/toc/2010/03</category>
      <guid>http://sujitpal.blogspot.com/2010_03_01_archive.html</guid>
    </item>
    <item>
      <title>Java REST Client for Ehcache server</title>
      <link>http://sujitpal.blogspot.com/2010/03/java-rest-client-interface-for-ehcache.html</link>
      <category domain="path">/toc/2010/03/03</category>
      <guid>http://sujitpal.blogspot.com/2010/03/java-rest-client-interface-for-ehcache.html</guid>
    </item>
    <item>
      <title>2009</title>
      <link>http://sujitpal.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00-08%3A00&amp;updated-max=2010-01-01T00%3A00%3A00-08%3A00&amp;max-results=44</link>
      <category domain="path">/toc/2009</category>
      <guid>http://sujitpal.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00-08%3A00&amp;updated-max=2010-01-01T00%3A00%3A00-08%3A00&amp;max-results=44</guid>
    </item>
    <item>
      <title>December</title>
      <category domain="path">/toc/2009/12</category>
    </item>
    <item>
      <title>A Lucene POS Tagging Tokenfilter</title>
      <link>http://sujitpal.blogspot.com/2009/12/lucene-pos-tagging-tokenfilter.html</link>
      <category domain="path">/toc/2009/12/02</category>
      <guid>http://sujitpal.blogspot.com/2009/12/lucene-pos-tagging-tokenfilter.html</guid>
    </item>
  </channel>
</rss>

The above XML was created from a bean which was manually populated with values. The XML generation code is not particularly interesting, but it is included below as part of the full code for this post. Real-life code would probably call multiple backing services to generate the bean, which would then be rendered into RSS by ROME.

Parsing this XML to extract the components is almost trivially simple. In this case, we are interested in the top level components, ie, those at level 1 of the tree. Our parsing code looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  @Test
  public void testParseSectionStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new SectionFilterPredicate());
    print(entries, false);
  }

  private class SectionFilterPredicate implements Predicate<SyndEntry> {
    @Override public boolean evaluate(SyndEntry entry) {
      String path = getCategoryValue(entry, "path");
      return StringUtils.split(path, "/").length == 1;
    }
  }

and results in the following list. The print() method above only prints the title and the path for this test, but obviously the client can do whatever he wants with the SyndEntry object.

1
2
3
4
5
6
    [junit] Salmon Run (/dek)
    [junit] Posts (/toc)
    [junit] Labels (/cloud)
    [junit] About me (/section)
    [junit] Scripting Luke with Javascript (/content)
    [junit] Share/Save (/labels)

What about multi-level elements such as the Posts element? Well, depending on whether the client will simply dump the component on his page or change it around, we can either provide the XHTML in a content:encoded tag or break it up into individual items. We have chosen to do both here, for illustration. To extract only the navigation elements, the parsing code could do something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
  @Test
  public void testParseNavigationStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new NavigationFilterPredicate());
    Collections.sort(entries, new PathComparator());
    print(entries, true);
  }

  private class NavigationFilterPredicate implements Predicate<SyndEntry> {
    @Override public boolean evaluate(SyndEntry entry) {
      String path = getCategoryValue(entry, "path");
      return path.startsWith("/toc") && 
        StringUtils.split(path, "/").length > 1;
    }
  }

  private class PathComparator implements Comparator<SyndEntry> {
    @Override public int compare(SyndEntry entry1, SyndEntry entry2) {
      String[] paths1 = 
        StringUtils.split(getCategoryValue(entry1, "path"), "/");
      String[] paths2 = 
        StringUtils.split(getCategoryValue(entry2, "path"), "/");
      Integer y1 = paths1.length < 2 ? 
        Integer.MAX_VALUE : Integer.valueOf(paths1[1]);
      Integer y2 = paths2.length < 2 ? 
        Integer.MAX_VALUE : Integer.valueOf(paths2[1]);
      if (y1 == y2) {
        Integer m1 = paths1.length < 3 ? 
          Integer.MAX_VALUE : Integer.valueOf(paths1[2]);
        Integer m2 = paths2.length < 3 ? 
          Integer.MAX_VALUE : Integer.valueOf(paths2[2]);
        if (m1 == m2) {
          Integer d1 = paths1.length < 4 ? 
            Integer.MAX_VALUE : Integer.valueOf(paths1[3]);
          Integer d2 = paths2.length < 4 ? 
            Integer.MAX_VALUE : Integer.valueOf(paths2[3]);
          if (d1 == d2) {
            String title1 = entry1.getTitle();
            String title2 = entry2.getTitle();
            return title1.compareTo(title2);
          } else {
            return d2.compareTo(d1);
          }
        } else {
          return m2.compareTo(m1);
        }
      } else {
        return y2.compareTo(y1);
      }
    }
  }

Which results in the following output. Note that the sorting is really not required if the server lists them out in the correct order, I only add it in as an example of how simple it is. In our case, the sorting was a bit more complex since we have ordering information built into the path components and we need to sort in reverse chronological order - in most cases, a simple lexical sort would be all that would be required, even if the server did not return them in the correct order (which would be a really strange and non-intuitive thing for the server to not do).

1
2
3
4
5
6
7
8
    [junit] ..2010 (/toc/2010)
    [junit] ...November (/toc/2010/11)
    [junit] ....Scripting Luke with Javascript (/toc/2010/11/03)
    [junit] ...March (/toc/2010/03)
    [junit] ....Java REST Client for Ehcache server (/toc/2010/03/03)
    [junit] ..2009 (/toc/2009)
    [junit] ...December (/toc/2009/12)
    [junit] ....A Lucene POS Tagging Tokenfilter (/toc/2009/12/02)

Equally simple is the ability to manipulate the structure. For example, the client may want to only show the navigation elements for the current year posts. To do that, the code would look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  @Test
  public void testParseNavigationStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new AndPredicate(
      new NavigationFilterPredicate(), new CurrentYearFilterPredicate()));
    print(entries, true);
  }

  private class CurrentYearFilterPredicate implements Predicate<SyndEntry> {

    private int currentYear;
    
    public CurrentYearFilterPredicate() {
      Calendar cal = Calendar.getInstance();
      this.currentYear = cal.get(Calendar.YEAR);
    }
    
    @Override public boolean evaluate(SyndEntry entry) {
      String[] pathElements = 
        StringUtils.split(getCategoryValue(entry, "path"), "/");
      if (pathElements.length >= 2) {
        return currentYear == Integer.valueOf(pathElements[1]);
      }
      return false;
    }
  }

Which results in the following output.

1
2
3
4
5
    [junit] ..2010 (/toc/2010)
    [junit] ...November (/toc/2010/11)
    [junit] ....Scripting Luke with Javascript (/toc/2010/11/03)
    [junit] ...March (/toc/2010/03)
    [junit] ....Java REST Client for Ehcache server (/toc/2010/03/03)

Conclusion

I personally think that the approach of "marking up" RSS item tags to indicate the location is quite simple and elegant. It does not require any extension or modification of the RSS specification, just a shared understanding about a single tag between server and client. But since I came up with it, I'm probably not being totally objective.

In my example, I have used ROME on both the server (XML generation) and client (XML parsing) ends to build and parse the RSS feed (as I would in real life too). Obviously, there is no guarantee that clients would use ROME (although it would simplify their lives enormously if they did). But the parsing stuff I talked about above can be done very easily with almost any XML parsing toolkit.

What do you think? Do you think the burden imposed on this client with this approach is excessive? Would it be better to come up with a custom microformat that handles this sort of data instead of using RSS? Would you have tackled the problem differently? Would appreciate hearing your concerns or if you have better/different ways of solving this problem.

Full Code

The full code for the above is modelled as a single JUnit test class, which is shown below. The interesting parts of the code has already been discussed above, this is here only for completeness.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
package com.mycompany.feeds.hierarchicaldata;

import java.io.File;
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

import org.apache.commons.collections15.CollectionUtils;
import org.apache.commons.collections15.Predicate;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

import com.sun.syndication.feed.WireFeed;
import com.sun.syndication.feed.module.content.ContentModule;
import com.sun.syndication.feed.module.content.ContentModuleImpl;
import com.sun.syndication.feed.module.mediarss.MediaEntryModuleImpl;
import com.sun.syndication.feed.module.mediarss.types.MediaContent;
import com.sun.syndication.feed.module.mediarss.types.UrlReference;
import com.sun.syndication.feed.synd.SyndCategory;
import com.sun.syndication.feed.synd.SyndCategoryImpl;
import com.sun.syndication.feed.synd.SyndContent;
import com.sun.syndication.feed.synd.SyndContentImpl;
import com.sun.syndication.feed.synd.SyndEntry;
import com.sun.syndication.feed.synd.SyndEntryImpl;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.feed.synd.SyndFeedImpl;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.WireFeedOutput;

public class HierarchyDataTest {

  private final Log log = LogFactory.getLog(getClass());
  
  @Test
  public void testBuild() throws Exception {
    Page page = buildTestPage();
    SyndFeed feed = buildEmptyFeed();
    // main entry
    SyndEntry mainEntry = new SyndEntryImpl();
    mainEntry.setTitle(page.title);
    mainEntry.setLink(page.link);
    setDescription(mainEntry, page.description);
    addCategory(mainEntry, "path", "/");
    addEntry(feed, mainEntry);
    // sections
    for (Section section : page.sections) {
      SyndEntry entry = new SyndEntryImpl();
      entry.setTitle(section.title);
      entry.setLink(section.link);
      setContent(entry, section.content);
      setImage(entry, section.imageUrl);
      addCategory(entry, "path", section.path);
      addEntry(feed, entry);
    }
    WireFeedOutput output = new WireFeedOutput();
    WireFeed wirefeed = feed.createWireFeed();
    FileUtils.writeStringToFile(new File("/tmp/hierarchy_test.xml"), 
      output.outputString(wirefeed), "UTF-8");
  }
  
  @Test
  public void testParseSectionStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new SectionFilterPredicate());
    print(entries, false);
  }

  @Test
  public void testParseNavigationStructure() throws Exception {
    SyndFeedInput input = new SyndFeedInput();
    SyndFeed feed = input.build(new File("/tmp/hierarchy_test.xml"));
    List<SyndEntry> entries = feed.getEntries();
    CollectionUtils.filter(entries, new NavigationFilterPredicate());
    Collections.sort(entries, new PathComparator());
    System.out.println("All navigation elements");
    print(entries, true);
    // show only current year navigation
    System.out.println("Current year navigation elements");
    CollectionUtils.filter(entries, new CurrentYearFilterPredicate());
    print(entries, true);
  }

  // ==== bean building methods ====
  
  private Page buildTestPage() {
    Page page = new Page();
    page.link = "http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html";
    page.title = "Scripting Luke with Javascript";
    page.description = "For quite some time now, I have been looking...";
    // dek
    Section dek = new Section();
    dek.title = "Salmon Run";
    dek.content = "Swimming upstream...blah, blah, blah";
    dek.path = "/dek";
    page.sections.add(dek);
    // table of contents
    Section toc = new Section();
    toc.title = "Posts";
    toc.content = "<div class=\"widget-content\">...</div>";
    toc.path = "/toc";
    page.sections.add(toc);
    // cloud
    Section cloud = new Section();
    cloud.title = "Labels";
    cloud.content = "<div class='widget-content cloud-label-widget-content'>...</div>";
    cloud.path = "/cloud";
    page.sections.add(cloud);
    // profile
    Section profile = new Section();
    profile.title = "About me";
    profile.content = "I am a programmer...blah, blah, blah";
    profile.imageUrl = "http://img.blogspot.com/path/to/my/picture.jpg";
    profile.link = "http://profiles.blogspot.com/path/to/my/complete/profile.html";
    profile.path = "/section";
    page.sections.add(profile);
    // main content
    Section content = new Section();
    content.title = "Scripting Luke with Javascript";
    content.link = "http://sujitpal.blogspot.com/2010/11/scripting-luke.html";
    content.path = "/content";
    content.content = "<p>For quite some time now, I have been looking...</p>";
    page.sections.add(content);
    // labels
    Section labels = new Section();
    labels.title = "Share/Save";
    labels.path = "/labels";
    labels.content = "javascript, lucene, luke, scripting, search";
    page.sections.add(labels);
    // navigation elements
    Section nav1 = new Section();
    nav1.title = "2010";
    nav1.link = "http://sujitpal.blogspot.com/search?updated-min=2010-01-01T00%3A00%3A00-08%3A00&updated-max=2011-01-01T00%3A00%3A00-08%3A00&max-results=33";
    nav1.path = "/toc/2010";
    page.sections.add(nav1);
    Section nav11 = new Section();
    nav11.title = "November";
    nav11.link = "http://sujitpal.blogspot.com/2010_11_01_archive.html";
    nav11.path = "/toc/2010/11";
    page.sections.add(nav11);
    Section nav111 = new Section();
    nav111.title = "Scripting Luke with Javascript";
    nav111.link = "http://sujitpal.blogspot.com/2010/11/scripting-luke-with-javascript.html";
    nav111.path = "/toc/2010/11/03";
    page.sections.add(nav111);
    Section nav12 = new Section();
    nav12.title = "March";
    nav12.link = "http://sujitpal.blogspot.com/2010_03_01_archive.html";
    nav12.path = "/toc/2010/03";
    page.sections.add(nav12);
    Section nav121 = new Section();
    nav121.title = "Java REST Client for Ehcache server";
    nav121.link = "http://sujitpal.blogspot.com/2010/03/java-rest-client-interface-for-ehcache.html";
    nav121.path = "/toc/2010/03/03";
    page.sections.add(nav121);
    Section nav2 = new Section();
    nav2.title = "2009";
    nav2.link = "http://sujitpal.blogspot.com/search?updated-min=2009-01-01T00%3A00%3A00-08%3A00&updated-max=2010-01-01T00%3A00%3A00-08%3A00&max-results=44";
    nav2.path = "/toc/2009";
    page.sections.add(nav2);
    Section nav21 = new Section();
    nav21.title = "December";
    nav21.path = "/toc/2009/12";
    page.sections.add(nav21);
    Section nav211 = new Section();
    nav211.title = "A Lucene POS Tagging Tokenfilter";
    nav211.link = "http://sujitpal.blogspot.com/2009/12/lucene-pos-tagging-tokenfilter.html";
    nav211.path = "/toc/2009/12/02";
    page.sections.add(nav211);
    // return the page bean
    return page;
  }

  private class Page {
    public String title;
    public String link;
    public String description;
    public List<Section> sections = new ArrayList<Section>();
  }
  
  private class Section {
    public String title;
    public String link;
    public String content;
    public String path;
    public String imageUrl;
  }
  
  // ==== feed building methods ====
  
  private SyndFeed buildEmptyFeed() {
    SyndFeed feed = new SyndFeedImpl();
    feed.setFeedType("rss_2.0");
    feed.setTitle("My feed title");
    feed.setDescription("Test for Hierarchical Data in RSS Feed");
    feed.setLink("http://www.foo.com/path/to/hypothetical/feed/service");
    feed.setLanguage("en_US");
    return feed;
  }

  private void addEntry(SyndFeed feed, SyndEntry entry) {
    List<SyndEntry> entries = feed.getEntries();
    if (entries == null) {
      entries = new ArrayList<SyndEntry>();
    }
    entries.add(entry);
  }
  
  private void setImage(SyndEntry entry, String imageUrl) {
    if (imageUrl != null) {
      try {
        MediaContent[] contents = new MediaContent[] {
          new MediaContent(new UrlReference(imageUrl))
        };
        MediaEntryModuleImpl module = new MediaEntryModuleImpl();
        module.setMediaContents(contents);
        entry.getModules().add(module);
      } catch (MalformedURLException e) {
        log.warn(e);
      }
    }
  }
  
  private void setContent(SyndEntry entry, String content) {
    if (content != null) {
      List<String> contents = new ArrayList<String>();
      contents.add(content);
      ContentModule module = new ContentModuleImpl();
      module.setEncodeds(contents);
      entry.getModules().add(module);
    }
  }
  
  private void setDescription(SyndEntry entry, String description) {
    SyndContent content = new SyndContentImpl();
    content.setValue(description);
    entry.setDescription(content);
  }
  
  private void addCategory(SyndEntry entry, String name, String value) {
    SyndCategory category = new SyndCategoryImpl();
    category.setName(value);
    category.setTaxonomyUri(name);
    entry.getCategories().add(category);
  }
  
  // ==== feed parsing methods ====
  
  private class SectionFilterPredicate implements Predicate<SyndEntry> {
    @Override public boolean evaluate(SyndEntry entry) {
      String path = getCategoryValue(entry, "path");
      return StringUtils.split(path, "/").length == 1;
    }
  }

  private class NavigationFilterPredicate implements Predicate<SyndEntry> {
    @Override public boolean evaluate(SyndEntry entry) {
      String path = getCategoryValue(entry, "path");
      return path.startsWith("/toc") && 
        StringUtils.split(path, "/").length > 1;
    }
  }

  private class CurrentYearFilterPredicate implements Predicate<SyndEntry> {

    private int currentYear;
    
    public CurrentYearFilterPredicate() {
      Calendar cal = Calendar.getInstance();
      this.currentYear = cal.get(Calendar.YEAR);
    }
    
    @Override public boolean evaluate(SyndEntry entry) {
      String[] pathElements = 
        StringUtils.split(getCategoryValue(entry, "path"), "/");
      if (pathElements.length >= 2) {
        return currentYear == Integer.valueOf(pathElements[1]);
      }
      return false;
    }
  }

  // if we need to change the implicit ordering
  private class PathComparator implements Comparator<SyndEntry> {

    @Override public int compare(SyndEntry entry1, SyndEntry entry2) {
      String[] paths1 = 
        StringUtils.split(getCategoryValue(entry1, "path"), "/");
      String[] paths2 = 
        StringUtils.split(getCategoryValue(entry2, "path"), "/");
      Integer y1 = paths1.length < 2 ? 
        Integer.MAX_VALUE : Integer.valueOf(paths1[1]);
      Integer y2 = paths2.length < 2 ? 
        Integer.MAX_VALUE : Integer.valueOf(paths2[1]);
      if (y1 == y2) {
        Integer m1 = paths1.length < 3 ? 
          Integer.MAX_VALUE : Integer.valueOf(paths1[2]);
        Integer m2 = paths2.length < 3 ? 
          Integer.MAX_VALUE : Integer.valueOf(paths2[2]);
        if (m1 == m2) {
          Integer d1 = paths1.length < 4 ? 
            Integer.MAX_VALUE : Integer.valueOf(paths1[3]);
          Integer d2 = paths2.length < 4 ? 
            Integer.MAX_VALUE : Integer.valueOf(paths2[3]);
          if (d1 == d2) {
            String title1 = entry1.getTitle();
            String title2 = entry2.getTitle();
            return title1.compareTo(title2);
          } else {
            return d2.compareTo(d1);
          }
        } else {
          return m2.compareTo(m1);
        }
      } else {
        return y2.compareTo(y1);
      }
    }
  }

  private String getCategoryValue(SyndEntry entry, String name) {
    List<SyndCategory> categories = entry.getCategories();
    String value = null;
    for (SyndCategory category : categories) {
      if (name.equals(category.getTaxonomyUri())) {
        value = category.getName();
        break;
      }
    }
    return value;
  }

  private void print(List<SyndEntry> entries, boolean indent) {
    for (SyndEntry entry : entries) {
      String path = getCategoryValue(entry, "path");
      System.out.println((indent ? 
        StringUtils.repeat(".", StringUtils.split(path, "/").length) : "") + 
        entry.getTitle() + " (" + path + ")");
    }
  }
}

Friday, November 19, 2010

Scripting Luke with Javascript

For quite some time now, I have been looking for a good way to build scripts that (a) allow people to "look inside" Luke indexes on a remote machine, and (b) can be embedded in larger scripts which run automatically via cron.

My first attempt was to write Python scripts with PyLucene, but it ended up never getting used, because our ops team prefer bash, and because of the effort of installing PyLucene on the production Unix boxes. PyLucene development has picked up again recently, but there was quite a long period during which we were using Lucene 2.x and PyLucene was only available for Lucene 1.x, so probably it was not such a bad thing.

My next attempt was with Lucli, the Lucene-CLI tool - I added a bunch of functionality to it, including the ability to run "lucli macros" (you can find the code here if you are interested). That never caught on with our scripting folks, however, probably because of the added complexity of having to maintain additional ".lucli" macro files in the repository - the preferred approach seems to be to send the commands into Lucli with a here document and parse the results with awk. Since we were not using the extra functionality of the local Lucli, when the time came to upgrade to Lucene 3.x, we simply pointed to the new JARs and it was business as usual.

Nowadays, when I need to do quick one-off analysis/debugging of Lucene indexes, I just use Jython and copy-paste from one of my older scripts. Not too different from writing Java code (ie we don't get PyLucene's Pythonic interface) but slightly more concise and easier to run from the command-line.

When I need to "look inside" a Lucene index on a remote machine, I ssh in with the -X option, then run Luke against the index on the remote machine. This points the DISPLAY on the remote machine to that of my local machine, so Luke shows up on my computer. The sequence of commands goes like this:

1
2
3
sujit@cyclone:~$ ssh -X sujit@avalanche
sujit@avalache's password: xxxx
sujit@avalanche:~$ luke.sh -index /path/to/index -ro

However, I recently downloaded Luke 1.0.1, and discovered that it came with a Javascript scripting console. It also takes a -script /path/to/script.js parameter on its command line, which got me all excited about the possibility of merging requirements (a) and (b) above into a single tool, and running against a codebase that (historically at least) has been faithfully tracking Lucene releases. Here's a screenshot:

However, a little testing showed that all the -script parameter does run the script within Luke's Javascript console - intuitively, the behavior I was expecting was for Luke to run the script, dump the results to STDOUT, and exit. I have an Issue open on Luke's Issue Tracker - feel free to vote for it if you agree.

Assuming that the above expectation is reasonable, and at some point in the future the -script parameter will behave as I think it should, I set about trying to figure out what I could do with the Javascript console. Here are some of the operations that I would use the scripting interface for:

Operation Comment
count([query]) If no query string is supplied, should return the number of records in the index. If query string is supplied, then it should return the number of matched records.
search(query) Execute the search specified by the query, and return the results.
find(fieldname, fieldvalue) Reads the index sequentially, returning documents where fieldname = fieldvalue.
get(docid) Return the document by docId
terms([fieldname]) Returns a map of all field names and their counts if no field name is specified. If fieldname is specified, then returns only the counts for this field name.

Javascript is not my favorite scripting language, and neither am I very good at it, but since it appears to be quite popular as an embedded scripting engine for Java-based apps (Alfresco and now Luke), I figured it was worth learning, and this would be a good opportunity. Here are the Javascript functions corresponding to the operations listed above. The ones prepended with an underscore are "private" functions used by the "public" (ie corresponding to an operation) functions..

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
function _is_undefined(value) {
  return typeof(value) === "undefined";
}

function _get_query(q) {
  var analyzer = 
    new Packages.org.apache.lucene.analysis.standard.StandardAnalyzer(
    app.getLuceneVersion());
  var parser = new Packages.org.apache.lucene.queryParser.QueryParser(
    app.getLuceneVersion(), "f", analyzer);
  return parser.parse(q);
}

function count() {
  print("count:" + ir.numDocs());
}

function count(q) {
  var searcher = new Packages.org.apache.lucene.search.IndexSearcher(ir);
  var query = _get_query(q);
  var hits = searcher.search(query, 100).scoreDocs;
  print("count(" + q + "):" + hits.length);
}

function search(q) {
  var searcher = new Packages.org.apache.lucene.search.IndexSearcher(ir);
  var query = _get_query(q);
  var hits = searcher.search(query, 100).scoreDocs;
  for (var i = 0; i < hits.length; i++) {
    get(hits[i].doc, hits[i].score);
  }
  searcher.close();
}

function find(key, val) {
  var numDocs = ir.numDocs();
  for (var i = 0; i < numDocs; i++) {
    var doc = ir.document(i);
    var docval = String(doc.get(key));
    if (docval == null) {
      continue;
    }
    if (val == docval) {
      get(i);
    }
  }
}

function get(docId, score) {
  if (_is_undefined(score)) {
    print("-- docId: " + docId + " --");
  } else {
    print("-- docId:" + docId + " (score:" + score + ") --");
  }
  var doc = ir.document(docId);
  var fields = doc.getFields();
  for (var i = 0; i < fields.size(); i++) {
    var field = fields.get(i);
    var fieldname = field.name();
    print(fieldname + ":" + doc.get(fieldname));
  }
}

function terms(fieldname) {
  var te = ir.terms();
  var termDict = {};
  while (te.next()) {
    var fldname = te.term().field();
    if (_is_undefined(termDict[fldname])) {
      termDict[fldname] = 1;
    } else {
      termDict[fldname] = termDict[fldname] + 1;
    }
  }
  if (fieldname == "") {
    var sortable = [];
    for (var key in termDict) {
      sortable.push([key, termDict[key]]);
    }
    var sortedTermDict = sortable.sort(function(a,b) { return b[1] - a[1]; });
    for (var i = 0; i < sortedTermDict.length; i++) {
      print(sortedTermDict[i][0] + ":" + sortedTermDict[i][1]);
    }
  } else {
    if (_is_undefined(termDict[fieldname])) {
      print("Field not found:" + fieldname);
    } else {
      print(fieldname + ":" + termDict[fieldname]);
    }
  }
}

// unit tests
print("#-docs in index");
count();
print("#-docs for title:bone");
count("title:bone");

print("Search for title:bone");
search("title:bone");

print("get doc 0");
get(0);

print("Find record with title: Broken bone");
find("title", "Broken bone");

print("printing all term counts");
terms("");
print("printing term counts for idx");
terms("idx");
print("printing term counts for non-existent field foo");
terms("foo");

The functions are pretty basic at the moment, I would want to be able to plug in custom analyzers and less frequently custom similarity implementations, and (even less frequently) custom sorts to the search function. But this can be easily accomplished by passing in extra parameters into the search function and a little bit of extra code.

I was unable to get a reference to the Version enum from within Javascript, so I had to add a new method getLuceneVersion() in Luke.java (so its now accessible as app.getLuceneVersion() from Javascript). It seems a reasonable thing to do since specific versions of Luke do track specific versions of Lucene. I added this method in and ran "ant dist" to rebuild the JARs so my shell script (see below) could find it.

1
2
3
  public Version getLuceneVersion() {
    return Version.LUCENE_30;
  }

To call Luke, I created a luke.sh file in luke-1.0.1 bin subdirectory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
# Source: Downloads/luke-1.0.1/bin/luke.sh
BASEDIR=/Users/sujit/Downloads/luke-1.0.1
export CLASS_PATH=\
$BASEDIR/lib/hadoop/commons-cli-1.2.jar:\
$BASEDIR/lib/hadoop/commons-codec-1.3.jar:\
$BASEDIR/lib/hadoop/commons-httpclient-3.0.1.jar:\
$BASEDIR/lib/hadoop/commons-logging-1.0.4.jar:\
$BASEDIR/lib/hadoop/commons-logging-api-1.0.4.jar:\
$BASEDIR/lib/hadoop/commons-net-1.4.1.jar:\
$BASEDIR/lib/hadoop/ehcache-1.6.0.jar:\
$BASEDIR/lib/hadoop/hadoop-0.20.2-core.jar:\
$BASEDIR/lib/hadoop/jets3t-0.6.1.jar:\
$BASEDIR/lib/hadoop/kfs-0.2.2.jar:\
$BASEDIR/lib/hadoop/log4j-1.2.15.jar:\
$BASEDIR/lib/hadoop/oro-2.0.8.jar:\
$BASEDIR/lib/hadoop/slf4j-api-1.4.3.jar:\
$BASEDIR/lib/hadoop/slf4j-log4j12-1.4.3.jar:\
$BASEDIR/lib/hadoop/xmlenc-0.52.jar:\
$BASEDIR/lib/js.jar:\
$BASEDIR/lib/lucene-analyzers-3.0.1.jar:\
$BASEDIR/lib/lucene-core-3.0.1.jar:\
$BASEDIR/lib/lucene-misc-3.0.1.jar:\
$BASEDIR/lib/lucene-queries-3.0.1.jar:\
$BASEDIR/lib/lucene-snowball-3.0.1.jar:\
$BASEDIR/lib/lucene-xml-query-parser-3.0.1.jar:\
$BASEDIR/dist/luke-1.0.1.jar
java -cp $CLASS_PATH org.getopt.luke.Luke $*

During development, I edited the functions inside a single test.js file outside Luke (the Javascript console does not have command history, so it is not the best place to do development). Then I call Luke once as follows:

1
sujit@cyclone:luke-1.0.1$ bin/luke.sh -index /path/to/index -ro

And then in the Javascript console, the full test.js file can be loaded up with a load("/path/to/test.js"); and it would run the whole thing.

For regular use, one could write a (bash, although I would prefer Python) script that takes the inputs such as path to index, query string, etc, as command line parameters, then creates a temporary file that imports the function definitions and builds and appends the function call to make (similar to my unit tests) at the end of this temporary file. It would then launch Luke with the -script option pointing to this temporary file, which would run the script, output its results to STDOUT, and exit. The script would then parse the output (for downstream use) or return it as-is.

Thinking about this some more, though, it does seem like a lot of work and a lot of complexity. The main advantage of this approach is that you can probably stick with bash scripting, delegating to Luke for the Lucene stuff. However, that aside, now that PyLucene is an official Apache Lucene subproject, and one can be reasonably certain that it too, will track Lucene releases as faithfully as Luke does, it may be time to just dust off the old PyLucene based Python scripts and keep things simple.