Saturday, July 30, 2011

Lucene: A Token Concatenating TokenFilter

In my previous post, I described my Neo4j/Lucene combo service that allows you to look up concepts in the Neo4j graph database by name or ID. The lookup happens using Lucene. My plan is to build an entity-recognition system (using an interface similar to OpenCalais - pass in a body of text and the service returns a list of entities in the text that matched a given vocabulary).

I hadn't thought this through completely before, but for the name lookup, I need exact match just like the ID lookup. So for example, if my text contains "Heart Attack", I want to recognize the entity "Heart Attack", not "Heart Attack Prevention". So my initial approach of stuffing all names into the same Lucene record as a multi-field had to change.

Based on the advice found in this solr-user mailing list discussion, I changed the code to write each name and synonym in a separate Lucene record, changed the analysis to use omitNorms(), and introduced an un-analyzed field for exact match boosting.

On the query side, I changed the query (analyzed with the analyzer chain described in my previous post) to add an optional exact query clause against the un-analyzed field name_s to boost the record whose name matched the query exactly, ie. Here is the updated code for both methods.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
  public void addNode(TConcept concept, Long nid) 
      throws IOException {
    Set<String> syns = new HashSet<String>();
    syns.add(concept.getPname());
    syns.add(concept.getQname());
    syns.addAll(concept.getSynonyms());
    for (String syn : syns) {
      Document doc = new Document();
      doc.add(new Field("oid", String.valueOf(concept.getOid()), 
        Store.YES, Index.ANALYZED));
      doc.add(new Field("syn", syn, Store.YES, Index.ANALYZED_NO_NORMS));
      doc.add(new Field("syn_s", StringUtils.lowerCase(syn), Store.YES, 
       Index.NOT_ANALYZED));
      doc.add(new Field("nid", String.valueOf(nid), 
        Store.YES, Index.NO));
      writer.addDocument(doc);
    }
    writer.commit();
  }

  ...

  public List<Long> getNids(String name) throws Exception {
    QueryParser parser = new QueryParser(Version.LUCENE_40, null, analyzer);
    Query q = parser.parse("+syn:\"" + name + 
        "\" syn_s:\"" + StringUtils.lowerCase(name) + "\"^100");
    ScoreDoc[] hits = searcher.search(q, Integer.MAX_VALUE).scoreDocs;
    List<Long> nodeIds = new ArrayList<Long>();
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      nodeIds.add(Long.valueOf(doc.get("nid")));
    }
    return nodeIds;
  }

Essentially, change the query from syn:"foo" to +syn:"foo" syn_s:"foo". This way, the first clause ensures that the record with "foo" in name is matched, and the second clause boosts the record whose name is exactly "foo" to the top of the results. Of course, this is just part of the solution - since my client expects to see exact matches (and there can be more than 1 exact match), the getNids() needs to have some post-processing code to remove the non-exact matches.

Of course, as Chris Hostetter points out in the discussion, if you want exact match, then you shouldn't tokenize in the first place. However, in my case, I do need to tokenize as part of my normalization process, which injects synonyms via dictionary and pattern replacement, removes stopwords and selectively stems the tokens. The solution for me, then, is to join back the tokens into a single term before writing it out to the index.

This post describes a TokenFilter that I wrote to put at the end of my Tokenizer/TokenFilter chain, which takes the tokens produced by upstream tokenizers and creates a set of phrase tokens out of them.

My code is based on a similar component written by Robert Gründler. My code differs from this component in that it uses a slightly newer Lucene API (with the trunk version, the next()::Token method is not available for overriding), and generates multiple phrase tokens if synonyms are encountered in the tokens (using position increment == 0). Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Source: src/main/java/com/mycompany/tgni/lucene/TokenConcatenatingTokenFilter.java

package com.mycompany.tgni.lucene;

import java.io.IOException;
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;

public class TokenConcatenatingTokenFilter extends TokenFilter {

  private CharTermAttribute termAttr;
  private PositionIncrementAttribute posIncAttr;
  
  private AttributeSource.State current;
  private LinkedList<List<String>> words;
  private LinkedList<String> phrases;

  private boolean concat = false;
  
  protected TokenConcatenatingTokenFilter(TokenStream input) {
    super(input);
    this.termAttr = addAttribute(CharTermAttribute.class);
    this.posIncAttr = addAttribute(PositionIncrementAttribute.class);
    this.words = new LinkedList<List<String>>();
    this.phrases = new LinkedList<String>();
  }

  @Override
  public boolean incrementToken() throws IOException {
    int i = 0;
    while (input.incrementToken()) {
      String term = new String(termAttr.buffer(), 0, termAttr.length());
      List<String> word = posIncAttr.getPositionIncrement() > 0 ?
        new ArrayList<String>() : words.removeLast();
      word.add(term);
      words.add(word);
      i++;
    }
    // now write out as a single token
    if (! concat) {
      makePhrases(words, phrases, 0);
      concat = true;
    }
    while (phrases.size() > 0) {
      String phrase = phrases.removeFirst();
      restoreState(current);
      clearAttributes();
      termAttr.copyBuffer(phrase.toCharArray(), 0, phrase.length());
      termAttr.setLength(phrase.length());
      current = captureState();
      return true;
    }
    concat = false;
    return false;
  }
  
  private void makePhrases(List<List<String>> words, 
      List<String> phrases, int currPos) {
    if (currPos == words.size()) {
      return;
    }
    if (phrases.size() == 0) {
      phrases.addAll(words.get(currPos));
    } else {
      List<String> newPhrases = new ArrayList<String>();
      for (String phrase : phrases) {
        for (String word : words.get(currPos)) {
          newPhrases.add(StringUtils.join(new String[] {phrase, word}, " "));
        }
      }
      phrases.clear();
      phrases.addAll(newPhrases);
    }
    makePhrases(words, phrases, currPos + 1);
  }
}

I just stick it to the end of the analyzer chain I've been using and rebuild my index. My revised analyzer chain looks like this (see the last line in the tokenStream() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public class QueryMappingAnalyzer extends Analyzer {

  private String aeDescriptor;
  private Set<?> stopset;
  
  public QueryMappingAnalyzer(String stopwordsFile, String aeDescriptor) 
      throws IOException {
    this.stopset = StopFilter.makeStopSet(Version.LUCENE_40, 
      new File(stopwordsFile));
    this.aeDescriptor = aeDescriptor;
  }
  
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    SynonymMap synonymMap = new SynonymMap();
    TokenStream input = new UimaAETokenizer(reader, aeDescriptor, 
      null, synonymMap);
    input = new SynonymFilter(input, synonymMap);
    input = new LowerCaseFilter(Version.LUCENE_40, input, true);
    input = new StopFilter(Version.LUCENE_40, input, stopset, false);
    input = new PorterStemFilter(input);
    // concatenate tokens produced by upstream analysis into phrase token
    input = new TokenConcatenatingTokenFilter(input);
    return input;
  }
}

Using this updated analyzer chain, I am able to revert to my previous indexing structure of storing synonyms together in one record (since each phrase or phrase synonym is stored as a complete unit), and revert my query to just use the single syn field, like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
  public void addNode(TConcept concept, Long nid) 
      throws IOException {
    Set<String> syns = new HashSet<String>();
    syns.add(concept.getPname());
    syns.add(concept.getQname());
    syns.addAll(concept.getSynonyms());
    Document doc = new Document();
    doc.add(new Field("oid", String.valueOf(concept.getOid()),
      Store.YES, Index.ANALYZED));
    for (String syn : syns) {
      doc.add(new Field("syn", syn, Store.YES, Index.ANALYZED));
    }
    doc.add(new Field("nid", String.valueOf(nid), Store.YES, Index.NO));
    writer.addDocument(doc);
//    for (String syn : syns) {
//      Document doc = new Document();
//      doc.add(new Field("oid", String.valueOf(concept.getOid()), 
//        Store.YES, Index.ANALYZED));
//      doc.add(new Field("syn", syn, Store.YES, Index.ANALYZED_NO_NORMS));
//      doc.add(new Field("syn_s", StringUtils.lowerCase(syn), Store.YES, Index.NOT_ANALYZED));
//      doc.add(new Field("nid", String.valueOf(nid), 
//        Store.YES, Index.NO));
//      writer.addDocument(doc);
//    }
    writer.commit();
  }

  ...

  public List<Long> getNids(String name) throws Exception {
    QueryParser parser = new QueryParser(Version.LUCENE_40, null, analyzer);
//    Query q = parser.parse("+syn:\"" + name + 
//        "\" syn_s:\"" + StringUtils.lowerCase(name) + "\"^100");
    Query q = parser.parse("syn:\"" + name + "\"");
    ScoreDoc[] hits = searcher.search(q, Integer.MAX_VALUE).scoreDocs;
    List<Long> nodeIds = new ArrayList<Long>();
    for (int i = 0; i < hits.length; i++) {
      Document doc = searcher.doc(hits[i].doc);
      nodeIds.add(Long.valueOf(doc.get("nid")));
    }
    return nodeIds;
  }

Of course, this does bring up the question of whether I really need to use Lucene for lookup, since my requirements seem to be exact match lookup by both OID and name. It seems to me that an embedded database such as HSQLDB or SQLite may perhaps be better choices. Of course, I would still use the Lucene analysis API to do the normalization before writing to the database and before reading from it, but the data would not be stored in a Lucene index. I have to think this through a bit before changing the lookup mechanism, just because I want to make sure that I will never need Lucene's search capabilities in this project.

6 comments (moderated to prevent spam):

shoonya said...

Hi Sujit,
i have been following u for a log time.
I have a query as we can keep multiple Token of Single words in same field Like Filed as MappedFSN:Quick
And we have token for Quick as Faster,Fast,Fasting and so on.
So can we have Numeric Tokens Like this in One field.
Like CATEGORYID:1234
And Tokens 324,34345,5467,34329 etc.
Regards
Prashant

Sujit Pal said...

Thanks Prashant, and yes, that would work as well. The idea here though is to accumulate all possible combinations of the terms so far, along with synonyms added to some of the terms in the phrase. So assume that our phrase is "quick fox" and our dictionary has {quick => fast, faster}. So the synonym tokenizer in the analyzer chain will produce the following terms: {quick, fast, faster} fox, and this tokenfilter will emit the following phrase tokens: {quick fox, fast fox, faster fox}. This should be possible for numeric tokens also

Anonymous said...

Hi Sujit,
I came across your blog when trying to learn how to write my own custom filter. I feel that I have a good grasp on how to start now, but I have a quick question regarding this piece of code you posted:

while (phrases.size() > 0) {
String phrase = phrases.removeFirst();
restoreState(current);
clearAttributes();
termAttr.copyBuffer(phrase.toCharArray(), 0, phrase.length());
termAttr.setLength(phrase.length());
current = captureState();
return true;
}

I was wondering why you return true within the while loop. If the incrementToken function returns, wouldn't you lose all the data stored in phrases?

Thanks,

Alex

Sujit Pal said...

Hi Alex, the phrases data structure is global, so it retains state across calls to incrementToken(). The idea is that the caller (Analyzer) will instantiate the Tokenizer, then call analyze() to pass in the data to be tokenized. The analyze() method calls incrementToken() in a loop and expects tokens back. So rather than changing upstream code, we make the splitting and buffering happens within the Tokenizer, and the Tokenizer is responsible for maintaining state (ie it needs to fill and drain phrases).

Anonymous said...

Hi Sujit,

Thanks a lot for posting this! I also wanted to tell that changing

concat = false;
return false;

to

concat = false;
phrases.clear();
words.clear();
return false;

Would make it work with multi-valued fields. Thank you once again!

Sujit Pal said...

You are welcome, and thank you for the code.