Friday, October 14, 2011

Lucene Wildcard Query and Permuterm Index

At a project I am working on, I needed to allow wildcard queries against the titles of documents stored in a Lucene index. The behavior expected is similar to a database LIKE query. My first thought (and current implementation) was to delegate the lookup to a database using a custom component, then using the results as the basis for a secondary search against Lucene. This approach has some limitations, though, so I decided to see if I could support this natively within Lucene.

Its not like Lucene doesn't provide ways to do this - there is the PrefixQuery which supports wildcards at the end, and the WildcardQuery which supports wildcards (both "*" and "?") at the beginning, middle, and end. However, the performance of a wildcard at the beginning is bad enough that it comes with a warning in the Javadocs. The way both of these work is by enumerating through the terms starting at the prefix, creating a massive BooleanQuery with all matching terms, wrapping it in a ConstantScoreQuery and executing it. So if you have a leading wildcard, the enumeration has to happen across all the terms in the index.

I needed a way to efficiently support the "*" wildcard in the beginning, middle and end of a string, and using the WildcardQuery with its performance caveat was not an option. Looking around, I discovered this rather old (circa 2006) thread on MarkMail that discussed various possibilities. One of these that that felt like it would suit my requirements was the use of the Permuterm index, which is basically an index of your terms and all its permutations.

In this post, I am going to describe my implementation (including code) of wildcard query using the permuterm index approach. Obviously the idea is not new, but a working implementation may help other people with similar problems, as well as suggest ideas for improvements, so here goes...

Permuterm Generating Token Filter

First, you need to create a permuterm index for the terms in the fields you want to query using wildcards. In my case it was only the title field, so I built a TokenFilter that placed the original term (a word in the title), as well as all its permuters at the same virtual location in the index. Here is the code for the TokenFilter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
// Source: src/main/java/com/mycompany/permuterm/PermutermTokenFilter.java
package com.mycompany.permuterm;

import java.io.IOException;
import java.util.Stack;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;

/**
 * TokenFilter to inject permuterms.
 */
public class PermutermTokenFilter extends TokenFilter {

  private CharTermAttribute termAttr;
  private PositionIncrementAttribute posIncAttr;
  private AttributeSource.State current;
  private TokenStream input;
  private Stack<char[]> permuterms;
  
  protected PermutermTokenFilter(TokenStream input) {
    super(input);
    this.termAttr = addAttribute(CharTermAttribute.class);
    this.posIncAttr = addAttribute(PositionIncrementAttribute.class);
    this.input = input;
    this.permuterms = new Stack<char[]>();
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (permuterms.size() > 0) {
      char[] permuterm = permuterms.pop();
      restoreState(current);
      termAttr.copyBuffer(permuterm, 0, permuterm.length);
      posIncAttr.setPositionIncrement(0);
      return true;
    }
    if (! input.incrementToken()) {
      return false;
    }
    if (addPermuterms()) {
      current = captureState();
    }
    return true;
  }

  private boolean addPermuterms() {
    char[] buf = termAttr.buffer();
    char[] obuf = new char[termAttr.length() + 1];
    for (int i = 0; i < obuf.length - 1; i++) {
      obuf[i] = buf[i];
    }
    obuf[obuf.length-1] = '$';
    for (int i = 0; i < obuf.length; i++) {
      char[] permuterm = getPermutermAt(obuf, i);
      permuterms.push(permuterm);
    }
    return true;
  }

  private char[] getPermutermAt(char[] obuf, int pos) {
    char[] pbuf = new char[obuf.length];
    int curr = pos;
    for (int i = 0; i < pbuf.length; i++) {
      pbuf[i] = obuf[curr];
      curr++;
      if (curr == obuf.length) {
        curr = 0;
      }
    }
    return pbuf;
  }

}

The token filter takes each term in the incoming token stream, permute its characters to produce all possible permutations of the characters, and places them at the same (virtual) position as the original term. Using this, a query with a wildcard at the beginning or middle can be converted to an equivalent query with a wildcard at the end. So for example, the term "cataract" is permuted to the following terms:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
cataract
$cataract
t$catarac
ct$catara
act$catar
ract$cata
aract$cat
taract$ca
ataract$c
cataract$

Building the Permuterm Index

The token filter can be invoked and terms loaded into an index using code such as the following. I had an old index lying around, so I just used the titles in it to create a new index with the titles analyzed using a chain containing the permuterm generating token filter described above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  @Test
  public void testLoadTitles() throws Exception {
    Analyzer analyzer = new Analyzer() {
      @Override
      public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream tokens = new WhitespaceTokenizer(Version.LUCENE_31, reader);
        tokens = new LowerCaseFilter(Version.LUCENE_31, tokens);
        tokens = new PermutermTokenFilter(tokens);
        return tokens;
      }
    };
    IndexReader reader = IndexReader.open(
      FSDirectory.open(new File("/path/to/my/old/index")));
    IndexWriterConfig iwconf = new IndexWriterConfig(
      Version.LUCENE_31, analyzer);
    iwconf.setOpenMode(OpenMode.CREATE);
    IndexWriter writer = new IndexWriter(
      FSDirectory.open(new File("/path/to/my/new/index")), iwconf);
    int numdocs = reader.maxDoc();
    for (int i = 0; i < numdocs; i++) {
      Document idoc = reader.document(i);
      Document odoc = new Document();
      String title = idoc.get("title");
      writer.addDocument(odoc);
    }
    reader.close();
    writer.commit();
    writer.optimize();
    writer.close();
  }

Since this is a test index it doesn't contain anything outside what I need. In a real life situation, you could invoke the custom analyzer using a PerFieldAnalyzerWrapper or define it in your fieldtype definition in Solr.

Wildcard Query using Permuterm Index

Once you have your index, a query with a wildcard at the beginning or middle can be rewritten to a Prefix query (ie one with a wildcard at the end). For example, a query "*act" would become "act$*" and a query such as "cat*act" would become "act$cat*". Queries which already have a wildcard at the end are modified slightly - "cat*" becoems "$cat*". The "$" marker indicates a word boundary.

I decided to implement this logic as a custom Query type, using a pattern similar to the WildcardQuery, somewhat unimaginatively named PermutermWildcardQuery. Here it is.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
// Source: src/main/java/com/mycompany/permuterm/PermutermWildcardQuery.java
package com.mycompany.permuterm;

import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.FilteredTermEnum;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.PrefixTermEnum;
import org.apache.lucene.search.SingleTermEnum;

public class PermutermWildcardQuery extends MultiTermQuery {

  private static final long serialVersionUID = 1245766111925148372L;

  private Term term;
  
  public PermutermWildcardQuery(Term term) {
    this.term = term;
  }
  
  @Override
  protected FilteredTermEnum getEnum(IndexReader reader) throws IOException {
    String field = term.field();
    String text = term.text();
    if (text.indexOf('*') == -1) {
      // no wildcards, use a term query
      return new SingleTermEnum(reader, term);
    } else {
      if (text.charAt(0) == '*' && 
          text.charAt(text.length()-1) == '*') {
        // leading and trailing '*', remove trailing '*'
        // and permute to make suffix query
        String ptext = permute(
          text.substring(0, text.length()-1), 0);
        return new PrefixTermEnum(reader, new Term(field, new String(ptext)));
      } else if (text.charAt(text.length()-1) == '*') {
        // trailing '*', pad with '$' on front and convert
        // to prefix query
        String ptext = "$" + text.substring(0, text.length()-1);
        return new PrefixTermEnum(reader, new Term(field, ptext));
      } else {
        // leading/within '*', pad with '$" on end and permute
        // to convert to prefix query
        String ptext = permute(text + "$", text.indexOf('*'));
        return new PrefixTermEnum(reader, new Term(field, ptext));
      }
    }
  }

  private String permute(String text, int starAt) {
    char[] tbuf = new char[text.length()];
    int tpos = 0;
    for (int i = starAt + 1; i < tbuf.length; i++) {
      tbuf[tpos++] = text.charAt(i);
    }
    for (int i = 0; i < starAt; i++) {
      tbuf[tpos++] = text.charAt(i);
    }
    return new String(tbuf, 0, tpos);
  }

  @Override
  public String toString(String field) {
    return "PermutermWildcardQuery(" + term + ")";
  }
}

The getEnum() method, which all subclasses of MultiTermQuery are required to implement, contains the logic to detect different types of query patterns and handling them slightly differently. The FilteredTermEnum that is generated is used by the superclass to construct a BooleanQuery of all the terms and to return the results.

Running the Query

To run the query with different query terms, I used the following simple JUnit test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  private static final String[] SEARCH_TERMS = new String[] {
    "cataract", "cat*", "*act", "cat*act", "*ara*"
  };
  @Test
  public void testPermutermQuery() throws Exception {
    IndexSearcher searcher = new IndexSearcher(
      FSDirectory.open(new File("/path/to/new/index")));
    for (String searchTerm : SEARCH_TERMS) {
      PermutermWildcardQuery q = new PermutermWildcardQuery(
        new Term("title", searchTerm));
      TopScoreDocCollector collector = 
        TopScoreDocCollector.create(10, true);
      searcher.search(q, collector);
      int numResults = collector.getTotalHits();
      ScoreDoc[] hits = collector.topDocs().scoreDocs;
      System.out.println("=== " + numResults + 
          " results for: " + searchTerm + 
          " (query=" + q.toString() + ") ===");
      for (int i = 0; i < hits.length; i++) {
        Document doc = searcher.doc(hits[i].doc);
        System.out.println(".. " + doc.get("title"));
      }
    }
    searcher.close();
  }

The results from this run are shown below. As you can see, the results look accurate enough, and they also return fast.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
=== 3 results for: cataract (query=PermutermWildcardQuery(title:cataract)) ===
.. Cataract
.. Congenital cataract
.. Cataract removal
=== 17 results for: cat* (query=PermutermWildcardQuery(title:cat*)) ===
.. Catheter-associated UTI
.. Cataract
.. Cat scratch disease
.. Congenital cataract
.. Caterpillars
.. Cataract removal
.. Cardiac catheterization
.. Catecholamines - blood
.. Catecholamines - urine
.. Urine culture - clean catch
=== 8 results for: *act (query=PermutermWildcardQuery(title:*act)) ===
.. Urinary tract infection - children
.. Urinary tract infection - adults
.. Contact dermatitis
.. Cataract
.. Developmental disorders of the female reproductive tract
.. Congenital cataract
.. Cataract removal
.. Biopsy - biliary tract
=== 3 results for: cat*act (query=PermutermWildcardQuery(title:cat*act)) ===
.. Cataract
.. Congenital cataract
.. Cataract removal
=== 44 results for: *ara* (query=PermutermWildcardQuery(title:*ara*)) ===
.. Parapneumonic pulmonary effusion
.. Periodic paralysis with hypokalemia
.. Hypokalemic periodic paralysis
.. Hyperkalemic periodic paralysis
.. Secondary hyperparathyroidism
.. Thyrotoxic periodic paralysis
.. Pseudohypoparathyroidism
.. Primary hyperparathyroidism
.. Hypoparathyroidism
.. Subarachnoid hemorrhage

This implementation does have limitations, of course. For one, multiple wildcards (except one each at the beginning and end) are not supported. I need to figure out what I should do in that case. It is likely that I will end up building my own custom FilteredTermEnum to iterate through the TermEnum using the first part (until the first occurrence of the wildcard), then match each subsequent using regular expressions. I haven't thought this through completely yet. A lot depends on whether we decide to switch out the current database backed implementation with this one (in that case I will think about improving it). If I do end up improving it, I will describe it here.

Be the first to comment. Comments are moderated to prevent spam.