Friday, August 07, 2009

Webservice Wrapper for Python scripts with CherryPy

In my previous post, I mentioned that the method of loading up the dictionary.txt in the first Mapper job may need revisiting for a distributed environment, since the slaves will not have access to the master's local filesystem. One way to achieve this is to wrap the data structure in a HTTP webservice, which is accessible from all the slaves. I started going down this path, when I learned that Hadoop's DistributedCache is tailormade for this situation. So this is probably not too useful in a Hadoop environment, but I am sure that it will come in useful elsewhere, so here goes.

The idea is that you build up the data structure once, and wrap it with a simple HTTP webserver. Calls that you would normally make to the data structure from client code are replaced by HTTP GET requests. If a method call needs arguments, they are passed in through the request parameters. I initially envisioned the response from the request to be a comma-separated text string but I changed it later to emit a JSON string instead, since using JSON gives me the ability to return structured objects if I need to. The client deserializes the JSON string back into the appropriate data structure for the application.

The JSON serialization-deserialization is handled by the simplejson library, so all the application developer has to do is to wrap the call in json.dumps() on the server side and json.loads() on the client side. In Python 2.6 and above, JSON support is built-in, but I use Python 2.5, so I had to install simplejson, which has an identical API to the built in JSON objects in Python 2.6 (going by the documentation for both).

For the HTTP server, I originally planned to write a simple one based on the many examples on the Internet, but kind of stumbled upon CherryPy while looking for one. CherryPy is actually a framework for writing (standalone, with built-in server) web applications, so its a bit larger than what I had in mind. However, it is so unobtrusive that it becomes almost invisible (a hallmark of great framework design in my opinion), so I decided to go ahead and use it. As you can see from the code for the server below, the code is mostly about the application, not about the framework.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#!/usr/bin/python
# Source: src/main/python/dict_server.py
import cherrypy
import os.path
import simplejson as json

DICTFILE = "/home/sujit/tmp/dictionary.txt"

class Dictionary:

  def __init__(self, dictpath=DICTFILE):
    """
    Loads up the file into an internal data structure {String:Set(String)}
    @param dictpath the path to the dictionary file, defaults to DICTPATH
    """
    if (os.path.exists(dictpath)):
      self.dictionary = {}
      dictfile = open(dictpath, 'rb')
      while (True):
        line = dictfile.readline()[:-1]
        if (not line):
          break
        line = line.lower()             # lowercase the line
        line = line.replace(" ", "-")   # replace all whitespace by "-"
        vals = []
        if (line.find(":") > -1):
          (lhs, rhs) = line.split(":")
          if (rhs.find(",") > -1):
            vals.extend(rhs.split(","))
          else:
            vals.append(rhs)
          self.dictionary[lhs] = vals
        else:
          self.dictionary[line] = vals
      dictfile.close()

  def index(self):
    """
    This method is called when the / request is made. This is just an
    informational page, mostly for human consumption, it will not be
    called from the client.
    """
    return """<br/><b>To get back all labels in dictionary, enter:</b>
           <br/>http://localhost:8080/labels
           <br/><b>To get back synonyms for a given label, enter:</b>
           <br/>http://localhost:8080/synonyms?label=${label}"""

  def labels(self):
    """
    Returns a JSON list of dictionary keys. Each element in the list
    corresponds to a "label" in one of my blog posts.
    @return a JSON list of dictionary keys.
    """
    cherrypy.response.headers["Content-Type"] = "application/json"
    return json.dumps(self.dictionary.keys())

  def synonyms(self, label):
    """
    Given a dictionary key, returns its human-generated "synonyms". This
    corresponds to the RHS of the dictionary.txt file. The synonyms are
    returned as a JSON list.
    @param label the dictionary key, must be provided
    @return a JSON list of synonyms for the dictionary keys.
    """
    try:
      cherrypy.response.headers["Content-Type"] = "application/json"
      return json.dumps(self.dictionary[label])
    except KeyError:
      return "[]"

  index.exposed = True
  labels.exposed = True
  synonyms.exposed = True

def main():
  cherrypy.quickstart(Dictionary())
  
if __name__ == "__main__":
  main()

The server exposes three URLs, one is the index page (which just provides an informational page for the service, for human consumption), a labels page which returns a comma-separated list of labels in the dictionary, and the synonyms page, which returns a comma-separated list of synonyms for a given label. I used a browser to debug the server, and CherryPy provides very nice error reporting.

To get at the information in these pages programatically, one can build a client inside the application that is consuming the service, similar to that shown below. The client below is a simple command line application that takes parameters as command line arguments and returns values from the server, but I have designed it as would a "real" application, with the main() method delegating to a remote facade that is responsible for talking to the server, so hopefully the example makes a bit more sense.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
#!/usr/bin/python
# Source: src/main/python/dict_client.py
import getopt
import httplib
import simplejson as json
import sys
import urllib

DEFAULT_SERVER = "localhost:8080"

class RemoteDictionary:
  """
  A client facade for the Dictionary object exposed by the server.
  Contains methods with identical signatures as the remote object,
  so client application can treat it as a local object in application
  code.
  """
  def __init__(self, server=DEFAULT_SERVER):
    self.server = server

  def labels(self):
    data = self.__transport__("/labels")
    return json.loads(data)

  def synonyms(self, label):
    params = urllib.urlencode({"label" : label})
    data = self.__transport__("/synonyms?%s" % (params))
    return json.loads(data)

  def __transport__(self, url):
    try:
      conn = httplib.HTTPConnection(self.server)
      conn.request("GET", url)
      response = conn.getresponse()
      if (response.status == 200):
        data = response.read()
        return data
    finally:
      conn.close()

def usage(message=""):
  if (len(message) > 0):
    print "Error: %s" % (message)
  print "Usage: %s [--label=${label}] labels|synonyms" % (sys.argv[0])
  print "--label|-l: the label for which synonyms are needed."
  print "labels    : show all labels in the dictionary."
  print "synonyms  : show all synonyms for a given label. The --label"
  print "parameter is required."
  print "One of labels or synonyms is required."
  sys.exit(-1)

def main():
  # extract and validate command parameters
  (opts, args) = getopt.getopt(sys.argv[1:], "l:h", ["label=", "help"])
  operation = ""
  if (len(args) > 0):
    operation = args[0]
    if (operation != "labels" and operation != "synonyms"):
      usage("Invalid operation [%s], should be 'labels' or 'synonyms'")
  else:
    usage("One of 'labels' or 'synonyms' must be specified")
  label = ""
  for option, argval in opts:
    if option in ("-h", "--help"):
      usage()
    if option in ("-l", "--label"):
      label = argval
  if (operation == "synonyms" and label == ""):
    usage("No label provided for 'synonyms' request")
  # pretend that this is a real application and delegate to the Remote
  # Dictionary Facade
  dictionary = RemoteDictionary()
  if (operation == "labels"):
    print dictionary.labels()
  else:
    print dictionary.synonyms(label)

if __name__ == "__main__":
  main()

To start the server, the dict_server.py is invoked without arguments from the command line, and it starts up a HTTP listener on port 8080. To terminate it when done using the service, enter CTRL+C.

1
2
3
4
5
sujit@sirocco:~$ ./dict_server.py 
...
[06/Aug/2009:23:29:10] ENGINE Serving on 127.0.0.1:8080
[06/Aug/2009:23:29:10] ENGINE Bus STARTED
...

Here are some examples of calling this from client code. My client is written in Python, but it could be written in any language, since all it is doing is making an HTTP GET request and reading the response.

1
2
3
4
5
sujit@sirocco:~$ ./dict_client.py labels
['edb', 'dojo', 'remoting', 'decorator-pattern', ...]
sujit@sirocco:~$ ./dict_client.py --label=ror synonyms
['ruby-on-rails', 'scripting', 'webapp-development']
sujit@sirocco:~$

The approach of wrapping a service in a webserver container is quite common, and is very useful when trying to share an expensive resource among a network of multiple clients. I was quite impressed by how simple this was with CherryPy. Although I don't need this at the moment, it is good to know that this is so easy to do, should I ever need to.

2 comments (moderated to prevent spam):

fumanchu said...

Hi Sujit,

Good article, and I'm glad you found CherryPy simple enough--we work hard at staying out of the way until needed.

One thing you might want to add in order for your CherryPy app to be more interoperable: specify the Content-Type response header when you emit JSON. That will help other clients which might hit your server to correctly parse the response.

In both of your page handler methods that return "json.dumps(xyz)" you can easily accomplish this by preceding each of those lines with the single line "cherrypy.response.headers['Content-Type'] = 'application/json'".

Sujit Pal said...

Thank you for your comment and your suggestion - I have added the content type declaration in the response header as you suggested. I took a look at your blog, and I see that the CherryPy "unobtrusiveness" that I was so impressed with is actually kind of a coding philosophy with you - which makes me look forward to using your other projects at some point in the future :-). I also want to see if I can mimic your programming approach with Jetty in Java.