Salmon Run: February 2009

Friday, February 27, 2009

Summarization with Lucene

You may have noticed that over the last couple of months, I haven't been writing too much about text mining. So I got sidetracked a bit - there's life beyond text mining, y'know :-). In any case, I am back on track to working through the remaining chapters of my TMAP book. This week, I describe a summarizer application built using Lucene.

Summarization involves reading a body of text, and summarizing it in your own words, and if done algorithmically, requires a fair amount of AI code and domain knowledge (about the text being summarized). Having seen this in action (as a consumer of the end product) at my previous job at CNET, and I can tell you that a lot of work goes into something of this sort. My goals are far less ambitious - I just want to find the "best" (most relevant) few sentences from a body of text and call it my summary.

A similar approach is taken by the two open-source summarizer applications I looked at, namely Classifier4J (C4J) and Open Text Summarizer (OTS). Based on reading this Linux.com article describing OTS and using it a bit, and looking at the sources for C4J's SimpleSummarizer, I realized that I could pick up good ideas from both applications and build one myself. For example, OTS uses XML files to specify grammar rules and a dictionary of excluded words, which could be implemented using a Lucene's PorterStemFilter and StopFilter respectively. Classifier4J tokenizes words and uses an in-memory HashMap to store a word-frequency map, which Lucene provides using its the terms() and docFreq() methods of IndexReader.

Algorithm and Code

My LuceneSummarizer tokenizes the input into paragraphs, and the paragraphs into sentences, then writes each sentence out to an in-memory Lucene index. It then computes the term frequency map of the index to find the most frequent words found in the document, takes the top few terms and hits the index with a BooleanQuery to find the most relevant sentences. The top few sentences (ordered by docId) thus found constitute the summary. The summarizer code is shown below - referenced classes are described further down.

// Source: src/main/java/com/mycompany/myapp/summarizers/LuceneSummarizer.java
package com.mycompany.myapp.summarizers;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;

import org.apache.commons.collections15.comparators.ReverseComparator;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

import com.mycompany.myapp.clustering.ByValueComparator;
import com.mycompany.myapp.tokenizers.ParagraphTokenizer;
import com.mycompany.myapp.tokenizers.SentenceTokenizer;

/**
 * Lucene based summarizer. Tokenizes a document into paragraphs and
 * paragraphs into sentences, then builds a in-memory lucene index for
 * the document with sentences as fields in single-field Lucene 
 * documents with index time boosts specified by the paragraph and
 * sentence number. Extracts the top terms from the in-memory index
 * and issue a Boolean OR query to the index with these terms, then
 * return the top few sentences found ordered by Lucene document id.
 */
public class LuceneSummarizer {

  private Analyzer analyzer = new StandardAnalyzer();
  private int numSentences = 2;
  private float topTermCutoff;
  // these two values are used to implement a simple linear deboost. If 
  // a different algorithm is desired, these variables are likely to be
  // no longer required.
  private float sentenceDeboost;
  private float sentenceDeboostBase = 0.5F;
  
  private ParagraphTokenizer paragraphTokenizer;
  private SentenceTokenizer sentenceTokenizer;
  
  /**
   * Allows setting a custom analyzer. Default is StandardAnalyzer
   * if not specified.
   * @param analyzer the analyzer to set.
   */
  public void setAnalyzer(Analyzer analyzer) {
    this.analyzer = analyzer;
  }
  
  /**
   * The number of sentences required in the summary. Default is 2.
   * @param numSentences the number of sentences in summary.
   */
  public void setNumSentences(int numSentences) {
    this.numSentences = numSentences;
  }
  
  /**
   * This value specifies where to cutoff the term list for query.
   * The text is loaded into an in-memory index, a sentence per
   * Lucene Document. Then the index is queried for terms and their
   * associated frequency in the index. The topTermCutoff is a 
   * ratio from 0 to 1 which specifies how far to go down the 
   * frequency ordered list of terms. The terms considered have 
   * a frequency greater than topTermCutoff * topFrequency. 
   * @param topTermCutoff a ratio specifying where the term list
   *        will be cut off. Must be between 0 and 1. Default is
   *        to consider all terms if this variable is not set,
   *        ie topTermCutoff == 0. But it is recommended to set
   *        an appropriate value (such as 0.5). 
   */
  public void setTopTermCutoff(float topTermCutoff) {
    if (topTermCutoff < 0.0F || topTermCutoff > 1.0F) {
      throw new IllegalArgumentException(
        "Invalid value: 0.0F <= topTermCutoff <= 1.0F");
    }
    this.topTermCutoff = topTermCutoff;
  }
  
  /**
   * Applies a index-time deboost to the sentences after the first
   * one in all the paragraphs after the first one. This attempts to
   * model the summarization heuristic that a summary can be generated
   * by reading the first paragraph (in full) of a document, followed
   * by the first sentence in every succeeding paragraph. The first 
   * paragraph is not deboosted at all. For the second and succeeding
   * paragraphs, the deboost is calculated as (1 - sentence_pos * deboost)
   * until the value reaches sentenceDeboostBase (default 0.5) or less, 
   * and then no more deboosting occurs. 
   * @param sentenceDeboost the deboost value to set. Must be between 
   *        0 and 1. Default is no deboosting, ie sentenceDeboost == 0.
   */
  public void setSentenceDeboost(float sentenceDeboost) {
    if (sentenceDeboost < 0.0F || sentenceDeboost > 1.0F) {
      throw new IllegalArgumentException(
        "Invalid value: 0.0F <= sentenceDeboost <= 1.0F");
    }
    this.sentenceDeboost = sentenceDeboost;
  }
  
  /**
   * This parameter is used in conjunction with sentenceDeboost. This
   * value defines the base until which deboosting will occur and then
   * stop. Default is set to 0.5 if not set. Must be between 0 and 1.
   * @param sentenceDeboostBase the sentenceDeboostBase to set.
   */
  public void setSentenceDeboostBase(float sentenceDeboostBase) {
    if (sentenceDeboostBase < 0.0F || sentenceDeboostBase > 1.0F) {
      throw new IllegalArgumentException(
        "Invalid value: 0.0F <= sentenceDeboostBase <= 1.0F");
    }
    this.sentenceDeboostBase = sentenceDeboostBase;
  }

  /**
   * The init method pre-instantiates the Paragraph and Sentence tokenizers
   * both of which are based on ICU4J RuleBasedBreakIterators, so they are
   * expensive to set up, therefore we set them up once and reuse them.
   * @throws Exception if one is thrown.
   */
  public void init() throws Exception {
    this.paragraphTokenizer = new ParagraphTokenizer();
    this.sentenceTokenizer = new SentenceTokenizer();
  }
  
  /**
   * This is the method that will be called by a client after setting up
   * the summarizer, configuring it appropriately by calling the setters,
   * and calling init() on it to instantiate its expensive objects.
   * @param text the text to summarize. At this point, the text should
   *        be plain text, converters ahead of this one in the chain
   *        should have done the necessary things to remove HTML tags,
   *        etc.
   * @return the summary in the specified number of sentences. 
   * @throws Exception if one is thrown.
   */
  public String summarize(String text) throws Exception {
    RAMDirectory ramdir = new RAMDirectory();
    buildIndex(ramdir, text);
    Query topTermQuery = computeTopTermQuery(ramdir);
    String[] sentences = searchIndex(ramdir, topTermQuery);
    return StringUtils.join(sentences, " ... ");
  }

  /**
   * Builds an in-memory index of the sentences in the text with the
   * appropriate document boosts if specified.
   * @param ramdir the RAM Directory to use.
   * @param text the text to index.
   * @throws Exception if one is thrown.
   */
  private void buildIndex(Directory ramdir, String text) throws Exception {
    if (paragraphTokenizer == null || sentenceTokenizer == null) {
      throw new IllegalArgumentException(
        "Please call init() to instantiate tokenizers");
    }
    IndexWriter writer = new IndexWriter(ramdir, analyzer,
      MaxFieldLength.UNLIMITED);
    paragraphTokenizer.setText(text);
    String paragraph = null;
    int pno = 0;
    while ((paragraph = paragraphTokenizer.nextParagraph()) != null) {
      sentenceTokenizer.setText(paragraph);
      String sentence = null;
      int sno = 0;
      while ((sentence = sentenceTokenizer.nextSentence()) != null) {
        Document doc = new Document();
        doc.add(new Field("text", sentence, Store.YES, Index.ANALYZED));
        doc.setBoost(computeDeboost(pno, sno));
        writer.addDocument(doc);
        sno++;
      }
      pno++;
    }
    writer.commit();
    writer.close();
  }

  /**
   * Applies a linear deboost function to simulate the manual heuristic of
   * summarizing by skimming the first few sentences off a paragraph.
   * @param paragraphNumber the paragraph number (0-based).
   * @param sentenceNumber the sentence number (0-based).
   * @return the deboost to apply to the current document.
   */
  private float computeDeboost(int paragraphNumber, int sentenceNumber) {
    if (paragraphNumber > 0) {
      if (sentenceNumber > 0) {
        float deboost = 1.0F - (sentenceNumber * sentenceDeboost);
        return (deboost < sentenceDeboostBase) ? 
          sentenceDeboostBase : deboost; 
      }
    }
    return 1.0F;
  }

  /**
   * Computes a term frequency map for the index at the specified location.
   * Builds a Boolean OR query out of the "most frequent" terms in the index 
   * and returns it. "Most Frequent" is defined as the terms whose frequencies
   * are greater than or equal to the topTermCutoff * the frequency of the
   * top term, where the topTermCutoff is number between 0 and 1.
   * @param ramdir the directory where the index is created.
   * @return a Boolean OR query.
   * @throws Exception if one is thrown.
   */
  private Query computeTopTermQuery(Directory ramdir) throws Exception {
    final Map<String,Integer> frequencyMap = 
      new HashMap<String,Integer>();
    List<String> termlist = new ArrayList<String>();
    IndexReader reader = IndexReader.open(ramdir);
    TermEnum terms = reader.terms();
    while (terms.next()) {
      Term term = terms.term();
      String termText = term.text();
      int frequency = reader.docFreq(term);
      frequencyMap.put(termText, frequency);
      termlist.add(termText);
    }
    reader.close();
    // sort the term map by frequency descending
    Collections.sort(termlist, new ReverseComparator<String>(
      new ByValueComparator<String,Integer>(frequencyMap)));
    // retrieve the top terms based on topTermCutoff
    List<String> topTerms = new ArrayList<String>();
    float topFreq = -1.0F;
    for (String term : termlist) {
      if (topFreq < 0.0F) {
        // first term, capture the value
        topFreq = (float) frequencyMap.get(term);
        topTerms.add(term);
      } else {
        // not the first term, compute the ratio and discard if below
        // topTermCutoff score
        float ratio = (float) ((float) frequencyMap.get(term) / topFreq);
        if (ratio >= topTermCutoff) {
          topTerms.add(term);
        } else {
          break;
        }
      }
    }
    StringBuilder termBuf = new StringBuilder();
    BooleanQuery q = new BooleanQuery();
    for (String topTerm : topTerms) {
      termBuf.append(topTerm).
        append("(").
        append(frequencyMap.get(topTerm)).
        append(");");
      q.add(new TermQuery(new Term("text", topTerm)), Occur.SHOULD);
    }
    System.out.println(">>> top terms: " + termBuf.toString());
    System.out.println(">>> query: " + q.toString());
    return q;
  }
  
  /**
   * Executes the query against the specified index, and returns a bounded
   * collection of sentences ordered by document id (so the sentence ordering
   * is preserved in the collection).
   * @param ramdir the directory location of the index.
   * @param query the Boolean OR query computed from the top terms.
   * @return an array of sentences.
   * @throws Exception if one is thrown.
   */
  private String[] searchIndex(Directory ramdir, Query query) 
      throws Exception {
    SortedMap<Integer,String> sentenceMap = 
      new TreeMap<Integer,String>();
    IndexSearcher searcher = new IndexSearcher(ramdir);
    TopDocs topDocs = searcher.search(query, numSentences);
    for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
      int docId = scoreDoc.doc;
      Document doc = searcher.doc(docId);
      sentenceMap.put(scoreDoc.doc, StringUtils.chomp(doc.get("text")));
    }
    searcher.close();
    return sentenceMap.values().toArray(new String[0]);
  }
}

Custom Analyzer

To provide stemming functionality (so that words such as concurrent and concurrency are treated as occurrences of the same word), I use Lucene's PorterStemFilter and the StopFilter with a custom stopword set from here to build a custom Analyzer for indexing the sentences. I found this article about Lucene Analyzers on Marcus Tripp's blog quite helpful. The code for my analyzer is shown below:

// Source: src/main/java/com/mycompany/myapp/summarizers/SummaryAnalyzer.java
package com.mycompany.myapp.summarizers;

import java.io.File;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.PorterStemFilter;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;

/**
 * Special purpose analyzer that uses a chain of PorterStemFilter, 
 * StopFilter, LowercaseFilter and StandardFilter to wrap a 
 * StandardTokenizer. The StopFilter uses a custom stop word set
 * adapted from:
 * http://www.onjava.com/onjava/2003/01/15/examples/EnglishStopWords.txt
 * For ease of maintenance, we put these words in a flat file and
 * import them on analyzer construction.
 */
public class SummaryAnalyzer extends Analyzer {

  private Set<String> stopset;
  
  public SummaryAnalyzer() throws IOException {
    String[] stopwords = filterComments(StringUtils.split(
      FileUtils.readFileToString(new File(
      "src/main/resources/stopwords.txt"), "UTF-8")));
    this.stopset = StopFilter.makeStopSet(stopwords, true);
  }
  
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new PorterStemFilter(
      new StopFilter(
        new LowerCaseFilter(
          new StandardFilter(
            new StandardTokenizer(reader))), stopset));
  }
  
  private String[] filterComments(String[] input) {
    List<String> stopwords = new ArrayList<String>();
    for (String stopword : input) {
      if (! stopword.startsWith("#")) {
        stopwords.add(stopword);
      }
    }
    return stopwords.toArray(new String[0]);
  }
}

The nice thing about using pre-built Lucene components is that they can be easily extended by adding new filters into the chain. For example, if we wanted to restrict our terms to be of a certain part-of-speech (say nouns), then it is quite simple to build a POSFilter that would use either the Wordnet database or a rule-based tagger such as Brill tagger.

Heuristics - Modeling Skim

One strategy used by human summarizers is a strategy called skimming that involves reading the first paragraph fully, and then reading the first few sentences from each of the succeeding paragraphs. Our LuceneSummarizer models this using a linear function that is applied to sentences after the first paragraph.

b = 1.0 - (f * i) if boost >= b₀
b = b₀ if boost < b₀
where:
b = computed boost factor
b₀ = minimum score after boost (${sentenceDeboostBase})
f = (de-)boost factor (${sentenceDeboost})
i = position of sentence in paragraph (0-based)

This results in successive sentences (from the second paragraph onwards) to be treated as less and less important, until it reaches a floor, at which point successive sentences in the paragraph get the same importance. The graph below is the actual boost values from one of my test cases, and should help visualizing the behavior of the above function.

Sentence Tokenization

To model the heuristic described above, I needed a paragraph tokenizer and a sentence tokenizer. I first tried using my old SentenceTokenizer based on Java's BreakIterator, but it would not recognize a line break as a sentence boundary, so I modified it to use the RuleBasedBreakIterator (RBBI) with default sentence break rules from the ICU4J project. That worked better, except that it would break over abbreviations such as "Mr." terminated by a period, so I had to put in a rule to suppress the break.

I describe the SentenceTokenizer first. As described here, I generate my default rules by dumping the rules from a sentence instance with RBBI.toString(). However, I had basically gotten lucky when customizing the RBBI for word tokenization - this time around, I really needed to understand the default rules, so looking through the commented sentence rules source file and the Unicode spec upon which it is based was practically mandatory for doing any customization. My rule file, including my $AbbrevWord rule, is shown below. Along with my own comments, I pulled in the source file comments as well, so its fairly easy to understand.

# Source: src/main/resources/sentence_break_rules.txt
# Default sentence break rules augmented with custom rules
# See: http://unicode.org/reports/tr29
# For a description of the variables involved.
# The initial file was generated by instantiating an RBBI.sentenceInstance
# and then RBBI.toString() to dump out the ruleset, then modified to suit
# the application.

# Character categories as defined in TR29 (see URL above).
$VoiceMarks   = [\uff9e\uff9f];
$Sep       = [\p{Sentence_Break = Sep}];
$Format    = [\p{Sentence_Break = Format}];
$Sp        = [\p{Sentence_Break = Sp}];
$Lower     = [\p{Sentence_Break = Lower}];
$Upper     = [\p{Sentence_Break = Upper}];
$OLetter   = [\p{Sentence_Break = OLetter}-$VoiceMarks];
$Numeric   = [\p{Sentence_Break = Numeric}];
$ATerm     = [\p{Sentence_Break = ATerm}];
$STerm     = [\p{Sentence_Break = STerm}];
$Close     = [\p{Sentence_Break = Close}];
$CR         = \u000d;
$LF         = \u000a;
$Extend     = [[:Grapheme_Extend = TRUE:]$VoiceMarks];
# Extended forms of the character classes, incorporate
# trailing Extend or Format chars. Rules 4 and 5
$SpEx       = $Sp      ($Extend | $Format)*;
$LowerEx    = $Lower   ($Extend | $Format)*;
$UpperEx    = $Upper   ($Extend | $Format)*;
$OLetterEx  = $OLetter ($Extend | $Format)*;
$NumericEx  = $Numeric ($Extend | $Format)*;
$ATermEx    = $ATerm   ($Extend | $Format)*;
$STermEx    = $STerm   ($Extend | $Format)*;
$CloseEx    = $Close   ($Extend | $Format)*;
# Abbreviations: words such as Mr. or George W. Bush should not break
# a sentence. An abbreviation is defined as an uppercase alpha char 
# followed by {0,3} lowercase alphas followed by a period
$AbbrevWord = ($UpperEx | ($UpperEx $LowerEx) | ($UpperEx $LowerEx{2}) \
  | ($UpperEx $LowerEx{3})) $ATerm;

!!chain;

!!forward;
# Rule 3 - break after separators. Keep CR/LF together.
$CR $LF {100};
# Rule 4 - Break after paragraph separator $Sep
# Rule 5 - Ignore $Format and $Extend
[^$Sep]? ($Extend | $Format)* {101};
# Rule 6 - Don't break after ambiguous terminator if its immediately
# followed by a number or lowercase letter.
# Replaced this condition to include optional space between the ambiguous
# terminator and number.
#$ATermEx $NumericEx {102};
$ATermEx ($SpEx)? $NumericEx {102};
# Rule 7 - Don't break if ambiguous terminator $ATerm is between two
# uppercase letters
$UpperEx $ATermEx $UpperEx {103};
# Rule 8 - Don't break if ambiguous terminator followed by other 
# continuation punctuation such as comma, colon, semicolon, etc.
$NotLettersEx = [^$OLetter $Upper $Lower $Sep $ATerm $STerm] \
  ($Extend | $Format)* {104};
$ATermEx $CloseEx* $SpEx* $NotLettersEx* $Lower {105};
# Rule added for recognizing $AbbrevWord
$AbbrevWord $CloseEx* $SpEx* $NotLettersEx* \
  ($Lower | $Upper | $NumericEx) {110};
# Rule 8a
($STermEx | $ATermEx) $CloseEx* $SpEx* ($STermEx | $ATermEx) {106};
# Rule 9, 10, 11 - Break after sentence terminator, but include closing 
# punctuation, trailing spaces and paragraph separator (if present).
($STermEx | $ATermEx) $CloseEx* $SpEx* $Sep? {107};
[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend]{bof}] \
  ($Extend | $Format | $Close | $Sp)* . {108};
[[^$STerm $ATerm $Close $Sp $Sep $Format $Extend]{bof}] \
  ($Extend | $Format | $Close | $Sp)* ([$Sep{eof}] | $CR $LF) {109};
# Rule 12 - otherwise... Don't break

!!reverse;
$SpEx_R       = ($Extend | $Format)* $Sp;
$ATermEx_R    = ($Extend | $Format)* $ATerm;
$STermEx_R    = ($Extend | $Format)* $STerm;
$CloseEx_R    = ($Extend | $Format)* $Close;
[{bof}] (.? | $LF $CR) [^$Sep]* [$Sep {eof}] ($SpEx_R* $CloseEx_R* \
  ($STermEx_R | $ATermEx_R))*;

!!safe_forward;

!!safe_reverse;

The code for the SentenceTokenizer is almost unchanged, the one difference is that we instantiate the BreakIterator differently. Instead of using a static sentence instance, we instantiate the RBBI from the sentence rule file. The choice of RBBI over the Java's BreakIterator also resulted in the additional init() method, where I instantiate the Paragraph and Sentence tokenizers, since they need to compile the rules on startup, and are hence expensive resources to instantiate.

// Source: src/main/java/com/mycompany/myapp/tokenizers/SentenceTokenizer.java
package com.mycompany.myapp.tokenizers;

import java.io.File;
import java.text.BreakIterator;

import org.apache.commons.io.FileUtils;

import com.ibm.icu.text.RuleBasedBreakIterator;

/**
 * Tokenizes the input into sentences. Uses ICU4J's RuleBasedBreakIterator
 * with rule file adapted from a dump of RBBI.sentenceInstance.
 */
public class SentenceTokenizer {

  private String text;
  private int index = 0;
  private RuleBasedBreakIterator breakIterator;
  
  public SentenceTokenizer() throws Exception {
    super();
    this.breakIterator = new RuleBasedBreakIterator(
      FileUtils.readFileToString(
      new File("src/main/resources/sentence_break_rules.txt"), "UTF-8"));
  }
  
  public void setText(String text) {
    this.text = text;
    this.breakIterator.setText(text);
    this.index = 0;
  }
  
  public String nextSentence() {
    int end = breakIterator.next();
    if (end == BreakIterator.DONE) {
      return null;
    }
    String sentence = text.substring(index, end);
    index = end;
    return sentence;
  }
}

Paragraph Tokenization

Neither Java's BreakIterator nor ICU4J's RBBI provides a paragraph iterator, but it is fairly simple (once you understand the rules) to modify the sentence rules to not break on certain separators and thereby create a set of rules for paragraph tokenization. Here is my file containing rules for paragraph tokenization for RBBI.

# Source: src/main/resources/paragraph_break_rules.txt
# Default sentence break rules augmented with custom rules
# See: http://unicode.org/reports/tr29
# For a description of the variables involved.
# The initial file was generated by instantiating an RBBI.sentenceInstance
# and then RBBI.toString() to dump out the ruleset, then modified to suit
# the application. Specifically, rules 9-11 have been modified to not match
# on STerm (sentence terminator) and ATerm (ambiguous terminator).

# Character categories as defined in TR29 (see URL above).
$VoiceMarks   = [\uff9e\uff9f];
$Sep       = [\p{Sentence_Break = Sep}];
$Format    = [\p{Sentence_Break = Format}];
$Sp        = [\p{Sentence_Break = Sp}];
$Lower     = [\p{Sentence_Break = Lower}];
$Upper     = [\p{Sentence_Break = Upper}];
$OLetter   = [\p{Sentence_Break = OLetter}-$VoiceMarks];
$Numeric   = [\p{Sentence_Break = Numeric}];
$ATerm     = [\p{Sentence_Break = ATerm}];
$STerm     = [\p{Sentence_Break = STerm}];
$Close     = [\p{Sentence_Break = Close}];
$CR         = \u000d;
$LF         = \u000a;
$Extend     = [[:Grapheme_Extend = TRUE:]$VoiceMarks];
# Extended forms of the character classes, incorporate
# trailing Extend or Format chars. Rules 4 and 5
$SpEx       = $Sp      ($Extend | $Format)*;
$LowerEx    = $Lower   ($Extend | $Format)*;
$UpperEx    = $Upper   ($Extend | $Format)*;
$OLetterEx  = $OLetter ($Extend | $Format)*;
$NumericEx  = $Numeric ($Extend | $Format)*;
$ATermEx    = $ATerm   ($Extend | $Format)*;
$STermEx    = $STerm   ($Extend | $Format)*;
$CloseEx    = $Close   ($Extend | $Format)*;
# Abbreviations: words such as Mr. or George W. Bush should not break
# a sentence. An abbreviation is defined as an uppercase alpha char 
# followed by {0,3} lowercase alphas followed by a period
$AbbrevWord = ($UpperEx | ($UpperEx $LowerEx) | \
  ($UpperEx $LowerEx{2}) | ($UpperEx $LowerEx{3})) $ATerm;

!!chain;

!!forward;
# Rule 3 - break after separators. Keep CR/LF together.
$CR $LF {100};
# Rule 4 - Break after paragraph separator $Sep
# Rule 5 - Ignore $Format and $Extend
[^$Sep]? ($Extend | $Format)* {101};
# Rule 6 - Don't break after ambiguous terminator if its immediately
# followed by a number or lowercase letter.
# Replaced this condition to include optional space between the ambiguous
# terminator and number.
#$ATermEx $NumericEx {102};
$ATermEx ($SpEx)? $NumericEx {102};
# Rule 7 - Don't break if ambiguous terminator $ATerm is between two
# uppercase letters
$UpperEx $ATermEx $UpperEx {103};
# Rule 8 - Don't break if ambiguous terminator followed by other 
# continuation punctuation such as comma, colon, semicolon, etc.
$NotLettersEx = [^$OLetter $Upper $Lower $Sep $ATerm $STerm] \
  ($Extend | $Format)* {104};
$ATermEx $CloseEx* $SpEx* $NotLettersEx* $Lower {105};
# Rule added for recognizing $AbbrevWord
$AbbrevWord $CloseEx* $SpEx* $NotLettersEx* \
  ($Lower | $Upper | $NumericEx) {110};
# Rule 8a
($STermEx | $ATermEx) $CloseEx* $SpEx* ($STermEx | $ATermEx) {106};
# Rule 9, 10, 11 - Break after sentence terminator, but include closing 
# punctuation, trailing spaces and paragraph separator (if present).
# NOTE: the above sentence rules have been modified for use for paragraph
# tokenization. Specifically, $STerm (sentence terminator) and $ATerm
# (ambiguous terminator) have been removed from the match for figuring
# out paragraph breaks.
$CloseEx* $SpEx* $Sep? {107};
[[^$Close $Sp $Sep $Format $Extend]{bof}] \
  ($Extend | $Format | $Close | $Sp)* . {108};
[[^$Close $Sp $Sep $Format $Extend]{bof}] \
  ($Extend | $Format | $Close | $Sp)* ([$Sep{eof}] | $CR $LF) {109};

!!reverse;
$SpEx_R       = ($Extend | $Format)* $Sp;
$CloseEx_R    = ($Extend | $Format)* $Close;
# See modification in !!forward rule to make the sentence rules a bit
# more lenient to allow paragraph tokenization.
[{bof}] (.? | $LF $CR) [^$Sep]* [$Sep {eof}] ($SpEx_R* $CloseEx_R*)*;

!!safe_forward;

!!safe_reverse;

And here is the code for ParagraphTokenizer. It is quite similar to the SentenceTokenizer, the only real difference is that it is created using a different rule file and that it exposes a nextParagraph() method instead of a nextSentence() method.

// Source: src/main/java/com/mycompany/myapp/tokenizers/ParagraphTokenizer.java
package com.mycompany.myapp.tokenizers;

import java.io.File;
import java.text.BreakIterator;

import org.apache.commons.io.FileUtils;

import com.ibm.icu.text.RuleBasedBreakIterator;

/**
 * Tokenizes the input into paragraphs, using ICU4J's rule-based break
 * iterator. Rule file is adapted from the rule file used internally by
 * the RBBI sentence tokenizer.
 */
public class ParagraphTokenizer {

  private String text;
  private int index = 0;
  private RuleBasedBreakIterator breakIterator;
  
  public ParagraphTokenizer() throws Exception {
    super();
    this.breakIterator = new RuleBasedBreakIterator(
      FileUtils.readFileToString(
      new File("src/main/resources/paragraph_break_rules.txt"), "UTF-8"));
  }
  
  public void setText(String text) {
    this.text = text;
    this.breakIterator.setText(text);
    this.index = 0;
  }
  
  public String nextParagraph() {
    int end = breakIterator.next();
    if (end == BreakIterator.DONE) {
      return null;
    }
    String sentence = text.substring(index, end);
    index = end;
    return sentence;
  }
}

Some sample results

My test case has code to run the C4J summarizer as well as my LuceneSummarizer. I used OTS as well on the command line and pulled out the first 2 sentences of the summary. Here is the code for the JUnit test. It tries to summarize, using C4J and then this LuceneSummarizer, three files from different Internet sources - I have manually cut and pasted the text content of these pages into local files, so I don't have to worry about HTML parsing for now.

// Source: src/test/java/com/mycompany/myapp/summarizers/SummarizerTest.java
package com.mycompany.myapp.summarizers;

import java.io.File;

import net.sf.classifier4J.summariser.ISummariser;
import net.sf.classifier4J.summariser.SimpleSummariser;

import org.apache.commons.io.FileUtils;
import org.junit.Test;

public class SummarizerTest {

  private static String[] TEST_FILES = {
    "src/test/resources/data/sujitpal-actors.txt",
    "src/test/resources/data/nytimes-obama.txt",
    "src/test/resources/data/resample-nb.txt"
  };
  
  @Test
  public void testC4JSummarizer() throws Exception {
    for (String testFile : TEST_FILES) {
      String text = FileUtils.readFileToString(new File(testFile), "UTF-8");
      ISummariser summarizer = new SimpleSummariser();
      System.out.println("Input: " + testFile);
      String summary = summarizer.summarise(text, 2);
      // replace newlines with ellipses
      summary = summary.replaceAll("\n+", "...");
      System.out.println(">>> Summary (from C4J): " + summary);
    }
  }

  @Test
  public void testLuceneSummarizer() throws Exception {
    for (String testFile : TEST_FILES) {
      String text = FileUtils.readFileToString(new File(testFile), "UTF-8");
      LuceneSummarizer summarizer = new LuceneSummarizer();
      summarizer.setAnalyzer(new SummaryAnalyzer());
      summarizer.setNumSentences(2);
      summarizer.setTopTermCutoff(0.5F);
      summarizer.setSentenceDeboost(0.2F);
      summarizer.init();
      System.out.println("Input: " + testFile);
      String summary = summarizer.summarize(text);
      System.out.println(
        ">>> Summary (from LuceneSummarizer): " + summary);
    }
  }
}

For testing, I used the text (content cut-n-pasted from the web page to local disk) of three web pages. Results are shown below.

Original	OTS	C4J	LuceneSummarizer
One of my own posts	Over the past few weeks, I've been looking at various (Java and Scala based) Actor frameworks. Taking the println() calls out from both the Scala and the Jetlang examples improved the performance of both significantly, and the Jetlang example ended up with lower elapsed time numbers than the Scala examples. ...	Over the past few weeks, I've been looking at various (Java and Scala based) Actor frameworks. ...Mike was kind enough to take a look at the Jetlang code, and he suggested that the excessive amounts of console IO that the actors were making were causing it to perform worse than Scala.	Mike was kind enough to take a look at the Jetlang code, and he suggested that the excessive amounts of console IO that the actors were making were causing it to perform worse than Scala. ... This week, I provide the updated code for Kilim and Jetlang, and code to work with Actor's Guild and ActorFoundry, and provide the elapsed time comparison between these frameworks (as well as the Scala examples from last week).
News Page from New York Times	WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. One such bad habit has been “the casual dishonesty of hiding irresponsible spending,” Mr. Obama said, citing the Bush administration’s technique of “budgeting zero dollars for the Iraq war — zero — for future years, even when we knew the war would continue.” ...	WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. ...The president said that the bank-rescue plan and the broader economic stimulus program are necessary not merely to jolt the economy but because the country has “long-term challenges — health care, energy, education and others — that we can no longer afford to ignore.”...“But I want to be very clear,” he said at a White House economic conference involving legislators, business and labor leaders and others.	WASHINGTON — Despite the huge sums the federal government is spending to aid banks and stimulate the economy, President Obama said on Monday that his administration will slash the federal budget deficit, in part by ending the “casual dishonesty” that has accompanied Washington budgets of late. ... The deficit is the year-by-year gap between what the federal government spends and the revenue it takes in.
Page Explaining Naive Bayes	For classification, we want to determine P (H\|X) -- the probability that the hypothesis H holds, given the observed data record X. For example, the probability that a fruit is an apple, given the condition that it is red and round.	Suppose your data consist of fruits, described by their color and shape. However, bias in estimating probabilities often may not make a difference in practice -- it is the order of the probabilities, not their exact values, that determine the classifications.	P (H\|X) is the posterior probability of H conditioned on X. For example, the probability that a fruit is an apple, given the condition that it is red and round. ... P(X) is the prior probability of X, i.e., it is the probability that a data record from our set of fruits is red and round.

Summarizing HTML pages

So far, I have been deliberately dealing with plain text pages in order to keep things simple. However, the use case that prompted this work has to do with generating summaries for HTML pages. In the past, I have built a HTML parser using Jericho as part of another simple-minded summarizer that pulled web-page metadata to generate summaries, which I plan to extend and reuse.

Since I am now going to parse the body, I would like to skip over the H[1-6], SCRIPT, STYLE and TABLE tags at the minimum in order to keep the summaries clean. Jericho allows you to build a TextExtractor object from your Source, which can be configured with various tags to exclude from the body, something like this:

    TextExtractor extractor = new TextExtractor(source) {
      public boolean excludeElement(StartTag startTag) {
        return 
          startTag.getName() == HTMLElementName.HEAD || 
          startTag.getName() == HTMLElementName.H1 || 
          startTag.getName() == HTMLElementName.H2 ||
          startTag.getName() == HTMLElementName.H3 ||
          startTag.getName() == HTMLElementName.H4 ||
          startTag.getName() == HTMLElementName.H5 ||
          startTag.getName() == HTMLElementName.H6 ||
          startTag.getName() == HTMLElementName.SCRIPT ||
          startTag.getName() == HTMLElementName.STYLE ||
          startTag.getName() == HTMLElementName.TABLE;
      }
    };
    extractor.setConvertNonBreakingSpaces(false);
    extractor.setExcludeNonHTMLElements(true);
    extractor.setIncludeAttributes(false);
    String text = extractor.toString();
    ...

Conclusion

My Lucene Summarizer seems to work decently, but I personally like the results from OTS and C4J better. However, this may be somewhat subjective, and could be the result of my own bias. I think the theory behind the application is sound, but perhaps I need to either play with the tuning parameters a bit more, or add extra heuristics to make the results better. If you have done similar stuff before, I would love to hear suggestions about how this can be improved.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

Saturday, February 21, 2009

Python SQL Runner

Recently, I have been working quite a bit with our taxonomy database, using the data in it as input to some data mining programs I am working on. Initially, I built in the database call into the programs themselves, but this approach turned out to be inconvenient for a number of reasons. One of these reasons is that the program is harder to restart in case of failures. Having these programs work with input files instead of a database query also results in cleaner and more extensible design, since now they can work with inputs other than SQL.

The databases I work with are MySQL and Oracle. The last time I needed this sort of stuff, I would whip up a Python script on-demand using either the MySQLdb or cx-oracle packages. One crashed hard disk and a couple of OS re-installs later, these scripts no longer work because the packages need to be re-installed as well. I have mostly gotten by since the crash using the MyEclipse Database Explorer and cut-n-paste, since the datasets I was working with were smaller, but the size of the data (~ 100K plus rows) I am dealing with now are likely to cause Eclipse to clutch its throat and do a crash-n-burn, taking my other unsaved code along with it, so I needed something that I could use from the command line.

This time around, I planned to do things a bit smarter than I did in the past, so I thought of building a generic script that would dump out rows given any SQL, something similar to my fledgling attempt many, many years ago. Funny how things come full circle from time to time. This time around, I also wanted to make sure its usage survives disk crashes and OS re-installs, so rather than have a script that requires extra dependencies, I wanted to use the JDBC drivers that I was going to install anyway. So my choices boiled down to either Jython or Scala.

I ended up writing it using Jython because my other scripts are all written in Python, and because writing database code in Python is slightly less verbose than in Java (or Scala, since that would have been the same thing). The name of the script is db2csv.py, and the code is shown below:

#!/usr/bin/env /opt/jython-2.2.1/jython
import sys
import getopt
import os.path
import traceback
from com.ziclix.python.sql import zxJDBC

class SqlRunner:
  """
  Class to run the SQL and print it out into the output file as a CSV
  """
  def __init__(self, dbconfig, sqlfile, outputfile, sepchar):
    """
    @param dbconfig the database configuration file name
    @param sqlfile the name of the file containing the SQL to be executed
    @param outputfile the name of the file where output will be written
    @param sepchar the separator character (or string) to use
    """
    self.dbconfig = dbconfig
    self.sqlfile = sqlfile
    self.outputfile = outputfile
    self.sepchar = sepchar

  def getDbProps(self):
    """
    Return the database properties as a map of name value pairs.
    @return a map of name value pairs
    """
    if (not os.path.exists(self.dbconfig)):
      raise Exception("File not found: %s" % (self.dbconfig))
    props = {}
    pfile = open(self.dbconfig, 'rb')
    for pline in pfile:
      (name, value) = pline[:-1].split("=")
      props[name] = value
    pfile.close()
    return props

  def getSql(self):
    """
    Validate the sql file name, and parse out the SQL to be run. The
    method will skip SQL single-line comments. Blocks enclosed by free
    standing multi-line comments are also skipped.
    @return the SQL as a string
    """
    if (not os.path.exists(self.sqlfile)):
      raise Exception("File not found: %s" % (self.sqlfile))
    sql = []
    sfile = open(self.sqlfile, 'rb')
    incomment = False
    for sline in sfile:
      sline = sline.rstrip('\n')
      if (sline.startswith("--") or len(sline.rstrip()) == 0):
        # SQL Comment line, skip
        continue
      if (sline.rstrip() == "/*"):
        # start of SQL comment block
        incomment = True
        continue
      if (sline.rstrip() == "*/"):
        # end of SQL comment block
        incomment = False
        continue
      if (not incomment):
        sql.append(sline)
    sfile.close()
    return " ".join(sql)

  def runSql(self):
    """
    Runs the SQL and prints it out into the specified output file as a CSV
    file delimited by sepchar.
    """
    props = self.getDbProps()
    sql = self.getSql()
    print "Running SQL: %s" % (sql)
    ofile = open(self.outputfile, 'wb')
    db = zxJDBC.connect(props["url"], props["user"], props["password"],
      props["driver"])
    cur = db.cursor(True)
    cur.execute(sql)
    # print the header
    meta = cur.description
    print "Writing output to: %s" % (self.outputfile)
    ofile.write(self.sepchar.join(map(lambda x: x[0], meta)) + "\n")
    for row in cur.fetchall():
      strrow = map(lambda x: str(x), row)
      ofile.write(self.sepchar.join(strrow) + "\n")
    ofile.close()
    cur.close()
    db.close()
    
def usage(error=""):
  """
  Print the usage information. If an error message is supplied, print that
  on top of the usage information.
  """
  if (len(str(error)) > 0):
    print "ERROR: %s" % (error)
    print "STACK TRACE:"
    traceback.print_exc()
  print "USAGE:"
  print "%s -d dbconfig -q queryfile -s sepchar -o outputfile" % (sys.argv[0])
  print "OR: %s -h" % (sys.argv[0])
  print "OPTIONS:"
  print "--dbconfig | -d  : database configuration file"
  print "  configuration file must be in properties format, with the following"
  print "  keys defined: driver, url, user and password"
  print "--queryfile | -q : name of file containing SQL to be run"
  print "--outputfile | -o: name of file where results should be written"
  print "--sep | -s       : the separator character to use in output"
  print "--help | -h      : print this information"
  sys.exit(2)

def extractOptions(argv):
  """
  Extract command line options and return a tuple
  @param argv the sys.argv object
  @return a tuple containing the information for running the SQL
  """
  try:
    (opts, args) = getopt.getopt(argv[1:], "d:q:s:o:h",
      ["dbconfig=", "queryfile=", "sep=", "outputfile=", "help"])
  except getopt.GetoptError:
    usage()
  if (len(filter(lambda x: x[0] in ("-h", "--help"), opts)) == 1):
    usage()
  if (len(opts) != 4):
    usage()
  for opt in opts:
    (key, value) = opt
    if (key in ("-d", "--dbconfig")):
      dbconfig = value
    elif (key in ("-q", "--queryfile")):
      sqlfile = value
    elif (key in ("-o", "--outputfile")):
      outputfile = value
    elif (key in ("-s", "--sep")):
      sepchar = value
    else:
      usage()
  return (dbconfig, sqlfile, outputfile, sepchar)
  
def main():
  """
  This is how we are called
  """
  (dbconfig, sqlfile, outputfile, sepchar) = extractOptions(sys.argv)
  sqlrunner = SqlRunner(dbconfig, sqlfile, outputfile, sepchar)
  try:
    sqlrunner.runSql()
  except Exception, e:
    usage(e)

if __name__ == "__main__":
  main()

As you can see, there is nothing really new here, you can get most of this from the relevant section of the Jython User Guide. I did start to use a bit of Python lambdas, which ironically, I learned on my recent foray into Scala-land.

In the spirit of not having to install anything but the base language, the zJDBC package comes standard with the Jython version (2.2.1) that I am using. Since I already use Java I will install it anyway, as well JDBC drivers for the different databases that I will talk to.

You can call the script using command line parameters as shown in the help output below:

sujit@sirocco:~$ ./db2csv.py -h
USAGE:
./db2csv.py -d dbconfig -q queryfile -s sepchar -o outputfile
OR: ./db2csv.py -h
OPTIONS:
--dbconfig | -d  : database configuration file
  configuration file must be in properties format, with the following
  keys defined: driver, url, user and password
--queryfile | -q : name of file containing SQL to be run
--outputfile | -o: name of file where results should be written
--sep | -s       : the separator character to use in output
--help | -h      : print this information

I had described in a previous post how one can add JAR files to the Jython classpath by appending the path names to sys.path in the code, but the approach didn't work for me here, perhaps because the appropriate Driver JAR class is being loaded explicitly by the code here. So I fell back to having it be loaded from the Java CLASSPATH instead. It's not such a bad thing, though - with this approach, one can use the script as is, for any database, as long as the JDBC driver exists on the CLASSPATH when the script is invoked. And if the driver is not in the CLASSPATH, the script tells you:

sujit@sirocco:~$ ./db2csv.py -d /path/to/config/file -q /path/to/sql/file.sql \
  -o /path/to/report/file -s "|"
Running SQL: select foo from bar where baz = 'baz'
ERROR: driver [com.mysql.jdbc.Driver] not found
STACK TRACE:
Traceback (most recent call last):
  File "./db2csv.py", line 156, in main
    sqlrunner.runSql()
  File "./db2csv.py", line 83, in runSql
    db = zxJDBC.connect(props["url"], props["user"], props["password"],
DatabaseError: driver [com.mysql.jdbc.Driver] not found
USAGE:
./db2csv.py -d dbconfig -q queryfile -s sepchar -o outputfile
OR: ./db2csv.py -h
OPTIONS:
--dbconfig | -d  : database configuration file
  configuration file must be in properties format, with the following
  keys defined: driver, url, user and password
--queryfile | -q : name of file containing SQL to be run
--outputfile | -o: name of file where results should be written
--sep | -s       : the separator character to use in output
--help | -h      : print this information

I also had to increase the JVM heap size because Jython was running out of heap space when running the queries. I did this directly in the Jython script, by adding -Xmx2048m to the java call.

In spite of the simplicity of the script, I am already finding it immensely useful. If you have reached here looking for a similar script, I hope you will find it useful as well.

Salmon Run

Friday, February 27, 2009

Summarization with Lucene

Algorithm and Code

Custom Analyzer

Heuristics - Modeling Skim

Sentence Tokenization

Paragraph Tokenization

Some sample results

Summarizing HTML pages

Conclusion

Saturday, February 21, 2009

Python SQL Runner

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me