Saturday, August 15, 2009

Semantic Tag Suggester in Python

Couple of weeks ago, I mentioned how nice it would be if I had a program that suggested "tags" (called Labels in Blogger) for my blog posts, using my existing tags and the content of the post as input. The motivation was to clean up my rather defocused, but nevertheless useful Tag Cloud on the left rail.

I initially thought of rolling up the tags into a set of 10 or so categories, but I like the ability to find an obscure post by clicking on the particular technology, something I would miss with this approach. So the solution I came up with was to keep both, but roll up the obscure tags into larger category tags. In addition, because I often refer to the same thing in slightly different ways, synonymy had to be considered, both for the base tag as well as when rolling it up into one or more categories. So my "dictionary" (full file available here) of tags looks something like this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Source: src/main/resources/blog_dict.txt
# Format:
# Label:Synonyms:Categories
acegi::java,security
actor:actors:concurrent-programming,patterns
actorfoundry::java,concurrent-programming,patterns,actor
ajax::javascript,web-development
algorithms::
annotations::java,programming
ant::java,programming
apache-cxf::java,webservices,soap
apache-solr::java,search,lucene
atom::xml,webservices
...

The fields are separated by the colon character. Label is the only field that is required, synonyms and categories are optional. The original list of labels (the first column) was built using a download of my blogs from Blogger and a few simple Unix commands. The second and third columns are built by hand, and are of course, rather subjective and context-dependent.

The usage of this tool depends a lot on the peculiarities of my blogging habits. Because I write my blog offline and copy-paste it into the Blogger editor tool for upload, I have access to the original text to pass into the script. The script parses the blog into an inverted index of words to positions, then loops through the labels and their synonyms in the dictionary to find the occurrences of these words or phrases. The output is a list of labels sorted in descending order by their frequency.

As before, the logic for parsing the body of the blog post into an inverted index depends on some home-grown markup that I use when writing my blogs. For one, I like to decide the title beforehand, and put it at the first line of my post in HTML <title> tags. I also put code and data inside <pre> tags. So the parsing logic uses the <title> tag marker to count any dictionary phrases TITLE_BOOST times, based on the assumption that the title describes the content quite well. Further, the parser skips (usually multi-line) content between the <pre> tags.

The script is written in Python and initially creates two data structures, the Dictionary and the InvertedIndex. The Dictionary provides methods to return the list of labels, and the synonyms and categories for a given label. The inverted index provides a method to return the occurrence count for a given word or phrase. Here is the script:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
#!/usr/bin/python
import sys
import getopt
import os.path
import re

### Configuration ###
# path to dict file
DEFAULT_DICT_PATH = "/home/sujit/bin/blog_dict.txt" 
TITLE_BOOST = 5      # score boost for title occurrences
COUNT_CUTOFF = 4     # min count to report a term
### Configuration ###

class InvertedIndex:
  """
  Class to convert the input file into an inverted index of words 
  and their corresponding word positions. The class uses this data 
  structure to return a count of occurrences of single or multi-
  word phrases in the document.
  """
  def __init__(self, inpath):
    self.index = {}
    self.wordSplitter = re.compile(r'[-_]')
    infile = open(inpath, 'r')
    regex = re.compile(r'[;,\.-_\" ]|<[^>].*>')
    pos = 0
    ignorable = False    # words are not counted while this is True
    while (True):
      line = infile.readline()
      if (not line):
        break
      # code blocks start with <pre> and end with </pre>, 
      # which we ignore.
      if (line.startswith("<pre>")):
        ignorable = True
      if (ignorable and line.startswith("</pre>")):
        ignorable = False
      if (ignorable):
        continue
      line = line[:-1].lower()
      words = regex.split(line)
      # make sure phrase matches spanning paragraphs are not considered
      pos = pos + 1
      boost = 1
      if line.startswith("<title>") and 
          line.endswith("</title>"):
        # word matches in title are scored TITLE_BOOST times normal
        # this is done by considering the title line TITLE_BOOST 
        # times, that way, neighboring words are still neighboring 
        # words, but they just "occur" TITLE_BOOST times in the title.
        boost = TITLE_BOOST
      for i in range(0, boost):
        for word in words:
          try:
            positions = self.index[word]
          except KeyError:
            positions = set()
          pos = pos + 1
          positions.add(pos)
          self.index[word] = positions
    infile.close()

  def getCount(self, term):
    words = self.wordSplitter.split(term.lower())
    prevPositions = set()
    newPositions = set()
    for word in words:
      try:
        positions = self.index[word]
        if (len(prevPositions) == 0):
          # the previous positions set is empty, which could only
          # happen if this is the first word in our term. So there
          # is nothing to filter against, so the current positions
          # becomes the basis for the filtering. If it exits at this
          # point, because this is a single word phrase, then this
          # will report success or failure based on the length of
          # positions collection.
          prevPositions.update(positions)
        else:
          # we have a previous position set, so we compare each
          # entry in our current positions to see if it is 1 ahead
          # of an entry in our previous positions, The intersection
          # will form the basis of the previous positions for the
          # next word in the phrase.
          newPositions = positions.intersection(
            map(lambda x: x+1, prevPositions))
          prevPositions.clear()
          prevPositions.update(newPositions)
          newPositions.clear()
      except KeyError:
        return 0
    return len(prevPositions)

class Dictionary():
  """
  Class to encapsulate the creation of a data structure from the 
  dictionary file. Format of the dictionary file is as follows:
  label:label_synonyms:label_categories
  where:
  label = the word or phrase that represents a label. If the label 
          is a multi-word label, it should be hyphenated.
  label_synonyms = a comma-separated list of synonyms for the label. 
          For example, webservice may be used instead of web-service. 
          As in the label field, multi-word synonyms should be 
          hyphenated.
  label_categories = a comma-separated list of categories the label 
          should roll up to. For example, cx_oracle could roll up to
          databases, oracle, python and scripting.
  """
  def __init__(self, dictpath):
    self.cats = {}
    self.syns = {}
    dictfile = open(dictpath, 'r')
    while (True):
      line = dictfile.readline()
      if (not line):
        break
      if (line.startswith("#")):
        # comment line, skip
        continue
      # strip out the line terminator, lowercase and replace all
      # whitespace with hyphen.
      line = line[:-1].lower().replace(" ", "-")
      (label, synonyms, categories) = line.split(":")
      # empty comma-separated synonyms or categories are stored in the
      # cats and syns dictionaries as a empty element, the filter
      # removes it so an empty string maps to an empty list
      self.cats[label] = filter(
        lambda x: len(x.strip()) > 0, categories.split(","))
      self.syns[label] = filter(
        lambda x: len(x.strip()) > 0, synonyms.split(","))
    dictfile.close

  def labels(self):
    """
    Return the list of all labels in the dictionary.
    @return the list of all labels.
    """
    return self.cats.keys()

  def synonyms(self, label):
    """
    Return the synonyms for the specified label. If no synonyms
    exist, returns an empty List.
    @param label the label to look up in the dictionary.
    @return a List of synonyms mapped to the label
    """
    try:
      return self.syns[label]
    except KeyError:
      return []

  def categories(self, label):
    """
    Return the categories for the specified label, If no categories
    exist, returns an empty List.
    @param label the label to look up in the dictionary.
    @return a List of categories mapped to the label.
    """
    try:
      return self.cats[label]
    except KeyError:
      return []
    
def usage(message=""):
  if (len(message) > 0):
    print "Error: %s" % (message)
  print "Usage: %s --help|([--dict=dict_file] --file=input_file)" % 
    (sys.argv[0])
  print "-d|--dict: full path to dictionary file"
  print "-f|--file: full path to input file to be analyzed"
  print "-i|--interactive: prompt for each label and create a label string"
  print "-h|--help: print this information"
  sys.exit(2)

def validate():
  """
  Parses the command line parameters and extracts the relevant info
  from it. Returns a triple of (dictpath, filepath, interactive).
  @return a triple extracted from the command line parameters.
  """
  (opts, args) = getopt.getopt(sys.argv[1:], "d:f:ih",
    ["dict=", "file=", "interactive", "help"])
  dictpath = None
  filepath = None
  interactive = False
  for option, argval in opts:
    if (option in ("-h", "--help")):
      usage()
    if (option in ("-d", "--dict")):
      dictpath = argval
      if (not os.path.exists(dictpath)):
        usage("Dictionary File [%s] does not exist" % (dictpath))
    if (option in ("-f", "--file")):
      filepath = argval
      if (not os.path.exists(filepath)):
        usage("Input File [%s] does not exist" % filepath)
    if (option in ("-i", "--interactive")):
      interactive = True
  if (dictpath == None):
    dictpath = DEFAULT_DICT_PATH
  if (filepath == None):
    usage("Input file must be specified")
  return (dictpath, filepath, interactive)

def addCount(countmap, term, count):
  """
  Adds the count to the term count mapping. If no term count mapping
  exists, then one is created and the count updated.
  @param countmap the map of term to term counts to update.
  @param term the term to update for.
  @param count the term count to update.
  """
  try:
    origCount = countmap[term]
  except KeyError:
    origCount = 0
  countmap[term] = origCount + count

def main():
  (dictpath, inpath, interactive) = validate()
  dictionary = Dictionary(dictpath)
  invertedIndex = InvertedIndex(inpath)
  occurrences = {}
  # grab the basic occurrence counts for the labels and its synonyms
  for term in dictionary.labels():
    termCount = invertedIndex.getCount(term)
    addCount(occurrences, term, termCount)
    for synonym in dictionary.synonyms(term):
      # if synonyms exist, also look for occurrences of the synonyms and
      # add it to the occurrence count for the label
      addCount(occurrences, term, invertedIndex.getCount(synonym))
    for category in dictionary.categories(term):
      # add the updated term count (for base label and all its synonyms)
      # to the category occurrences.
      addCount(occurrences, category, occurrences[term])
  terms = occurrences.keys()
  # filter out terms whose counts are below our cutoff
  terms = filter(lambda x: occurrences[x] > COUNT_CUTOFF, terms)
  # sort the remaining (filtered) terms by their count descending
  terms.sort(lambda x, y: occurrences[y] - occurrences[x])
  if (interactive):
    labels = []
    for term in terms:
      yorn = raw_input("%s (%d) - include[Y/n]? " % 
        (term, occurrences[term]))
      if (yorn == 'n' or yorn == 'N'):
        continue
      labels.append(term)
    print "Labels: %s" % (",".join(labels))
  else:
    for term in terms:
      print "%s (%d)" % (term, occurrences[term])
    print "Labels: %s" % (",".join(terms))
    
if __name__ == "__main__":
  main()

The choice of the default dictionary file and the title boost and minimum word occurrence score cutoffs is dependent on a particular setup, so they are in the Configuration block at the beginning of the script. The code is pretty heavily commented, so I won't bore you trying to describe it here. The interactive mode returns the labels in descending order one by one, and you can opt to keep it or not keep it, based on your judgement.

I ran the script on this post and here is what I got. To the suggested labels, I also added data-structure and algorithms, since apart from the user defined InvertedIndex and Dictionary structures, this also has the very useful Python set structure.

1
2
3
4
sujit@sirocco:$ ./tagsuggest.py --file=/path/to/my/post.txt
python (6)
scripting (6)
Labels: python,scripting

Overall, I think this script is going to be pretty useful for me. The quality of its suggestions is dependent on the content of the dictionary, and is not 100% accurate. Fortunately, since it is interactive, it does not have to be - it just has to be better than my own memory, so it "knows" about tags that I have used but no longer remember.

Be the first to comment. Comments are moderated to prevent spam.