Sunday, November 25, 2007

Python Database hacks

I seem to have writer's block this weekend. Or if you want to be more pragmatic, I did not take the time to work on a mini-project that I can write about. In reality, I spent last week goofing off, decompressing after an almost complete project, and putting finishing touches to a bunch of other little projects that have been in the pipeline for a while. As a result, I don't have anything to write about this weekend. A little disappointing, given that this was Thanksgiving weekend, and I had four days of vacation to do something interesting. But as so often happens with long vacations, the time gets sucked up into other little non-consequential things.

So this week, I will write about three little features that I came across while writing Python scripts to work against an Oracle database. These should work with other databases as well. Maybe they don't even deserve to be called hacks since they are adequately documented if you look hard enough, but its probably new stuff for beginning Python programmers such as myself. Hopefully this will be useful to others who are in the same situation as me.

Exception Handling

I found this out when trying to update a bunch of linked database tables using a Python script. An error in my program or an unhandled problem in the data would result in a partial update to the tables, forcing me to manually rollback the changes that had been applied. To handle this, Python has a try:...except: clause that can be used to check for exceptions and commit or rollback. Here is the code snippet to do this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
  import traceback
  ...
  conn = cx_Oracle.connect(connection_string)
  cursor1 = conn.cursor()
  cursor2 = conn.cursor()
  ...
  try:
    ... do something with various cursors derived from conn
    conn.commit()
  except Exception, inst:
    print "Error:", inst
    print traceback.print_exc(sys.exc_info()[2])
    conn.rollback()

The first print in the exception block will print the database error message, and the second print will print the stack trace. You can find more information in the Python tutorial page for Errors and Exceptions.

Accessing Database Metadata

I had posted code for a Python CSV to Excel converter in my blog sometime back. Most of the time, I would use this to convert CSV files dumped from a Java program or Python script into an Excel spreadsheet for consumption by my flat-file challenged clients. I decided to extend it recently to take in a SQL SELECT statement, and then dump out the results into an Excel spreadsheet. In addition to the rows, I also needed to get at the column names for the header. Python provides a description method for the cursor which returns this information, like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  conn = cx_Oracle.connect(connection_string)
  cursor = conn.cursor()
  cursor.execute(query)
  rows = cursor.fetchall()
  colnames = []
  metadaten = cursor.description()
  for metadata in metadaten:
    colnames.append(metadata[0])
  print ",".join(colnames)
  ...

The description method actually returns a lot more information about the column, but I was only interested in the column name. More information about the description() method and the contents of the data structure returned by it can be found in the ASPN Python Cookbook page here.

Unicode Handling

If your database is set up to store only ASCII characters, and if your data has characters which fall outside the 0..127 range, then you will have problems with storing this data. To get around this, you can have your script replace the non-ascii characters with "?" using the following call before you insert or update.

1
  asciifield = unicodefield.encode("ascii", "replace")

The encode call takes other parameters and character sets, see the Python Unicode Tutorial for more information and examples.

Saturday, November 17, 2007

To Acegi or not to Acegi

In a webservice application I am currently working on, users are identified by a userId. Like any typical webservice, each request to the service needs to have the userId parameter tacked on to the request parameters. The application has a fairly sophisticated, but home-grown authentication and authorization mechanism that it inherited from the previous version of the application. Since I was using a new architecture and new libraries anyway for this application, I figured that I would give Acegi, the popular Spring based security framework, a try. I ultimately decided to not use Acegi, but I describe here the new classes and configuration I needed to build to integrate my application with Acegi, why I ended up not using it, and what parts of Acegi I decided to reimplement in my code.

The original application

The webservice is accessed via HTTP GET calls. It returns XML (or JSON) as a response. The webservice application carries out the following authentication checks. In all cases, if the check fails, a HTTP Forbidden status code of 403 with an appropriate error message is returned.

  • Is the userId supplied in the request parameters? A userId request parameter must be specified in the request.
  • If supplied, is it valid? We do a match against our user database to find out.
  • If valid, is it coming from an IP address for which the user is registered? When the user is registered in the system, zero or more subnet masks are also registered. If no subnet masks are registered, then this authentication step is skipped. Otherwise, the user's IP address must match one of the registered subnet masks for the user.

Once these checks are done, the following authorization checks are made. As before, if the check fails, a HTTP Forbidden status code of 403 with an appropriate error message is returned.

  • Is the user authorized to make the webservice method call he is making? The user is authorized to one or more webservice methods in the user database.
  • Is the user authorized to any explicitly requested content? A method may return one or more response elements. If the user specifically requests one or more elements in the request, then this check is made.
  • During the time the response is generated, each element is checked against the user's content authorization, and if the user is not authorized for the content, generation of the element is skipped.

The webservice is written using the Spring Framework. The first 5 checks are done by a Spring HandlerInterceptor which intercepts all method calls to the Controller object that the DispatcherServlet delegates to. If any of these fail, the response is sent to an error page with the appropriate HTTP Status code (403).

The sixth check is done in my ResponseGenerator class. The Controller delegates to a suitably configured ResponseGenerator based on the webservice method requested. The ResponseGenerator itself is simply a delegator that takes a chain of ResponseElementGenerator objects and defines a strategy for processing these objects. The chain is injected into the ResponseGenerator using Spring configuration. Because of this, the authorization code only occurs once in the application.

As you can see, the code is fairly well structured (most of the credit goes to my predecessor who structured the first 5 authentication checks in a Servlet Filter implementation in a previous non-Spring version of the application). However, I had heard a lot about Acegi, most recently from an ex-colleague who was actually integrating it into one of his client's applications, so I decided that I should at least check it out.

Changes needed for Acegi

With Acegi, I would have to configure a FilterToBeanProxy as a filter in my web.xml file, which would proxy a FilterChainProxy in my Spring application context. The FilterChainProxy would be configured with a custom Filter implementation that would extract and authenticate the user's request and use a configured custom AuthenticationManager that would actually retrieve the user by name using a custom UserDetailsService into a custom UserDetails bean. As you can see, quite a lot of custom work here. The filter declaration in the web.xml is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
<!-- web.xml -->
<web-app ...>

  <filter>
    <filter-name>Acegi Filter Chain Proxy</filter-name>
    <filter-class>org.acegisecurity.util.FilterToBeanProxy</filter-class>
    <init-param>
      <param-name>targetClass</param-name>
      <param-value>org.acegisecurity.util.FilterChainProxy</param-value>
    </init-param>
  </filter>
  <filter-mapping>
    <filter-name>Acegi Filter Chain Proxy</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

  ...
</web-app>

The supporting beans for the Acegi FilterChainProxy are defined in the Spring application context below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
<beans ...>

  <bean id="filterChainProxy" class="org.acegisecurity.util.FilterChainProxy">
    <property name="filterInvocationDefinitionSource">
      <value>
        CONVERT_URL_TO_LOWERCASE_BEFORE_COMPARISON
        PATTERN_TYPE_APACHE_ANT
        /**=requestProcessingFilter
      </value>
    </property>
  </bean>

  <bean id="requestProcessingFilter" class="com.mycompany.myapp.MyRequestProcessingFilter">
    <property name="authenticationManager" ref="authenticationManager"/>
  </bean> 

  <bean id="authenticationManager" class="org.acegisecurity.providers.ProviderManager">
    <property name="providers">
      <list>
        <ref local="authenticationProvider"/>
      </list>
    </property>
  </bean>
  
  <bean id="authenticationProvider" class="com.mycompany.myapp.MyAuthenticationProvider">
    <property name="userDetailsService" ref="userDetailsService"/>
    <property name="userCache" ref="userCache"/>
  </bean>

  <bean id="userDetails" class="com.mycompany.myapp.MyUserDetails"/>
  
  <bean id="userDetailsService" class="com.mycompany.myapp.MyUserDetailsService">
    <!-- reference to a DAO to retrieve users from db -->
    <property name="userDao" ref="userDao"/>
  </bean>

</beans>

The code for the various classes referenced above are shown below. The filterChainProxy proxies the bean named requestProcessingFilter, which is in the Spring applicationContext, and which therefore can be injected with the AuthorityManager. The requestProcessingFilter pulls the userId out of the request parameters and creates a seed Authentication token with the principal set to the userId. Then it delegates to the AuthenticationProvider which returns an Authentication token populated with the UserDetails object corresponding to the userId. This is followed by a series of authentication checks (1-5).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// MyRequestProcessingFilter.java
package com.mycompany.myapp;

public class MyRequestProcessingFilter implements Filter {

  private AuthenticationProvider authenticationProvider;
  
  @Required
  public void setAuthenticationProvider(AuthenticationProvider authenticationProvider) {
    this.authenticationProvider = authenticationProvider;
  }
  
  public void init(FilterConfig filterConfig) throws ServletException { /* NOOP */ }

  public void destroy() { /* NOOP */ }

  public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
    String userId = ServletRequestUtils.getRequiredStringParameter(request, "userId");
    Authentication authentication = new UsernamePasswordAuthenticationToken(userId, userId);
    authentication = authenticationProvider.authenticate(authentication);
    if (authentication.getPrincipal() instanceof UserDetails) {
      // this is the only place in the code where we have access to the
      // request and remote IP address, so do the additional checks here
      MyUserDetails myuserdetails = (MyUserDetails) authentication.getPrincipal();
      if (! myuserdetails.isAuthorizedForSubnet(request.getRemoteAddress())) {
        SecurityContextHolder.getContext().setAuthentication(null);
      }
      if (! myuserdetails.isAuthorizedForService(getServiceFromRequest(request))) {
        SecurityContextHolder.getContext().setAuthentication(null);
      }
      SecurityContextHolder.getContext().setAuthentication(authentication);
    } else {
      SecurityContextHolder.getContext().setAuthentication(null);
    }
    chain.doFilter(request, response);
  }
}

The AuthenticationProvider in turn delegates to the UserDetailsService to load the user from the database. If a user is not found, then the AuthenticationProvider returns the original Authentication object passed in.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// MyAuthenticationProvider.java
public class MyAuthenticationProvider implements AuthenticationProvider {

  private UserDetailsService userDetailsService;
  
  @Required
  public void setUserDetailsService(UserDetailsService userDetailsService) {
    this.userDetailsService = userDetailsService;
  }

  public Authentication authenticate(Authentication authentication) throws AuthenticationException {
    Object principal = authentication.getPrincipal();
    if (principal instanceof String) {
      UserDetails userdetails = userDetailsService.loadUserByUsername((String) principal);
      if (userdetails != null) {
        Authentication usernamePasswordAuthToken = new UsernamePasswordAuthenticationToken(userdetails, userdetails);
        usernamePasswordAuthToken.setAuthenticated(true);
        return usernamePasswordAuthToken;
      }
    }
    return authentication;
  }

  public boolean supports(Class authentication) {
    return (UsernamePasswordAuthenticationToken.class.isAssignableFrom(authentication));
  }
}

The UserDetails service delegates to an application specific UserDao that returns the UserDetails object. In reality, this returns an User object and is passed through some adapter code to convert this to a UserDetails object, but I omitted that for brevity.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// MyUserDetailsService.java
public class MyUserDetailsService extends JdbcDaoSupport implements UserDetailsService {

  private UserDao userDao;
  
  @Required
  public void setUserDao(UserDao userDao) {
    this.userDao = userDao;
  }
  
  public UserDetails loadUserByUsername(String userId) throws UsernameNotFoundException, DataAccessException {
    return userDao.getUserById(userId);
  }
}

The MyUserDetails.java class internally stores the GrantedAuthority objects as a Map of Set of Strings, so it can be used more efficiently to look up a user's authority for a specified service or content. It also returns an array of GrantedAuthority objects as per the interface's contract. I have not included the code for this as it is long and not relevant to the discussion.

1
2
3
4
5
6
7
// MyUserDetails.java
public class MyUserDetails implements UserDetails {

  private String username;
  private Map<String,Set<String>> authorities = new HashMap<String,Set<String>>();
  ...
}

For the sixth check (content level authorization), the Acegi recommended approach is to use a MethodInterceptor to wrap each ResponseElementGenerator's generate() method and use a Spring AOP Proxy to inject into the ResponseGenerator. This approach, however, would add a lot of bulk to the applicationContext.xml file, since each ResponseElementGenerator implementation (and there are quite a few) would need to be declaratively wrapped into an AOP Proxy. It is far easier and more maintainable, though perhaps not as neat, to just have this logic inside the ResponseGenerator itself, which is the current approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// ResponseGenerator.java
public class ResponseGenerator {

  private List<IResponseElementGenerator> generators;
  
  @Required
  public void setResponseElementGenerators(List<IResponseElementGenerator> generators) {
    this.generators = generators;
  }

  public List<? extends IResponse> generate(ParameterBean parameters, OpenSearchInfo openSearchInfo) throws Exception {
    for (IResponseElementGenerator generator : generators) {
      Authentication authentication = SecurityContextHolder.getContext().getAuthentication();
      // at this point, our authentication should already contain our 
      // UserDetails class
      MyUserDetails userdetails = (MyUserDetails) authentication.getPrincipal();
      if (! userdetails.isAuthorizedForContent(generator.getContentType())) {
        // user not authorized for content, skip this generator
        continue;
      }
      ...
    }
  }
}

Why I ended up not using Acegi

After all this, I decided not to use Acegi after all. Although Acegi comes with a lot of built in components, none of the components seemed reusable for my application, which is why I ended up writing the custom implementations described above. At this point, I had written about as many lines of code for integrating Acegi as there were in the original security handling code in the application. Using Acegi in my application at this point would have been just for the sake of using it.

Adding Acegi features to my application

There were a few things that I saw in Acegi that I decided to implement in my application. They were the use of the static ThreadLocal class SecurityContextHolder to make the authenticated User object available throughout the application, and the built in caching of the User object.

My original application was using a reference to the UserDao inside the HandlerInterceptor as well as in the ResponseGenerator. I also needed it in some of the ResponseElementGenerator implementations, so there were lots of references to the userDao all over my applicationContext.xml file. Based on how the SecurityContextHolder works, I am now pulling in the User object in the HandlerInterceptor and putting it into a ThreadLocal called UserHolder. The ResponseGenerator, as well as the ResponseElementGenerators uses the UserHolder.get() call within the same request thread to retrieve the authenticated user from anywhere in the application.

The AbstractAuthenticationProvider class provides built in caching using EhCache. My current application had a home-grown in-memory cache of users which would get refreshed at a specific time interval, inherited from the previous version of the application. I was already using EhCache for caching webservice responses, so I decided to make my UserDao cached using EhCache with a fixed time-to-live setting.

When would I use Acegi

I think Acegi would have been a good choice if my application did not already have a security mechanism built in, or if the security mechanism it used was based on a username/password scheme, which Acegi has built-in components for. It would also have been a good choice if I was going to connect to one of the popular authentication engines that Acegi already contains connectors for. Because of the way it uses chains of AuthenticationProviders, it may also be a good choice if I was looking to change or add authentication mechanisms in the future.

Saturday, November 10, 2007

Rediscovering awk

My first job after college was with HCL Limited (now HCL Infosystems), the #1 Unix-based minicomputer manufacturer (at least at that time) in India. Unix was not as ubiquitous as it is today, at least in India, but it was making aggressive inroads into shops dominated by proprietary mainframe and midrange OSes. At that point I knew nothing about Unix - our college computers had DEC VAX/VMS and MS-DOS installed, and I had done some work with ICL's mainframes and a proprietary minicomputer OS called BEST as a summer intern.

The first thing HCL did was to ship the new recruits to a 2 week boot camp at their training school at Dehra Doon, a little mountain city in North India. There, after about a 2 day introduction to Unix and common commands, we were broken up into groups of 2 and each group handed an approximately 80-100 page paper manual and 24 hours to prepare a presentation for our classmates on a common Unix command. Ours was awk. It wasn't the hardest, considering that two other groups got lex and yacc, but it wasn't the easiest thing to do either, considering our experience with Unix so far, and the fact that we had a green screen monitor with csh for command history. Needless to say, the presentation did not go too well.

Over the years, I had used various Unix commands like sed, grep, cut, paste, tr, etc, hooked them up into shell scripts, worked with scripting languages like Perl and Python, but I have always steered clear of awk. Probably because I could get my work done without using awk, or maybe it was some sort of subconscious fear thing. Anyway, never used awk since then, that is, until a few weeks ago.

I had this Java program which was crawling a set of pages, and pulling down the title, summary and URL for each page into a pipe-delimited records into a flat file. So the Java program would dump a file that looked like this. The top two lines are not part of the output, they are to aid in describing the file format for this blog.

1
2
3
4
5
6
7
8
# output.txt
# TITLE|URL|SUMMARY
r1c1|r1c2|r1c3
r2c1|r2c2|r2c3
r3c1|r3c2|r3c3
r4c1|r4c2|r4c3
r5c1|r5c2|r5c3
...

I would then use sed scripts to replace the pipe characters appropriately, so my output file would look something like this:

1
2
3
4
# output.txt
# TITLE|URL|SUMMARY
insert into sometable(title,url,summary)values('r1c1','r1c2','r1c3');
...

And then use this file as an input SQL script to load all the information into a database table. Some time later, I found a bug in the summary generation algorithm, so I needed to update the summaries in the database. So the SQL to be generated would be like this:

1
2
3
4
# output.txt
# TITLE|URL|SUMMARY
update sometable set summary='r1c2' where title='r1c1' and url='r1c2';
...

I could simply change the Java code to rewrite the columns appropriately, but this seemed to be an almost textbook application of awk, so I bit. So my awk script to reorder the columns looked like this:

1
2
3
sujit@sirocco:/tmp$ gawk -F"|" \
  --source '{printf("%s|%s|%s\n",$3,$1,$2)}' \
  output.txt > output1.txt

And then apply another sed script to replace the pipes with the appropriate text to create the update SQL call.

A few days later, I had a another situation where I was given a file which had been originally generated from one of my programs, uploaded to a database table, but later annotated by a human being. Specifically, certain items had been marked as 'delete' and I knew that I could use the URL as a unique key for the delete SQL. So the annotated file looked like this:

1
2
3
4
5
6
7
8
# output1.txt
# TITLE|URL|SUMMARY|ANNOTATION
r1c1|r1c2|r1c3|keep
r2c1|r2c2|r2c3|delete
r3c1|r3c2|r3c3|keep
r4c1|r4c2|r4c3|keep
r5c1|r5c2|r5c3|delete
...

Flushed with my recent success with awk, I decided to give it a go again. This time, all I needed was the rows annotated with "delete", and the URL to use as the key. So my new awk script looked like:

1
2
3
sujit@sirocco:/tmp$ gawk -F"|" \
  --source '{if ($4=="delete") printf("%s\n",$2)}' \
  output1.txt > output2.txt

It turns out that awk (or gawk, its GNU cousin that I am using) is actually pretty much a full fledged programming language, and has quite a comprehensive set of built in string processing functions. One of my colleagues remarked that he had seen entire web sites written with awk. I don't know if I would want to do that, but from what I see, it is a very powerful tool for writing compact string transformations on the command line. I am definitely going to use this more in future, replacing use-cases where I used a combination of cut, paste and sed to do command line text file processing.

Even after so many years, the awk man pages still appear a bit dense to me. However, there is a lot of free information available on awk on the Internet. I used Greg Goebel's awk tutorial, An Awk Primer, which I found to be very useful. There is also the Awk reference page on the Unix-Manuals site, which can come in useful if you already know how to write scripts in awk and need a function reference.

Saturday, November 03, 2007

lighttpd+PHP setup on Fedora Core 4

The Apache httpd webserver on my desktop at work is configured to work with Tomcat, on which I develop web applications. For various reasons, there is a high degree of coupling between Apache and Tomcat, with custom httpd.conf and rewrite.conf files. Recently I had a need to build a dynamic page which would not be powered by a Java web application. To do this under Apache, I would have to swap out the custom configs, and swap in the original configs that came with Apache, then restart Apache. To revert back to doing web development (which I would very likely need to do), I would have to swap the config files out in the other direction and restart. So using the built-in Apache server was not going to be very convenient, given my situation.

I had read about the lighttpd webserver when playing with Ruby on Rails in the past. Lighttpd is a very fast and lightweight web server, much like Jetty in the Java application server world. So I decided to install it as an extra webserver on my Fedora Core 4 desktop along with Apache so I could develop my dynamic page on it.

To install lighttpd, I found some useful information in this Digital Media Minute Howto page for Ruby on Rails. Essentially, to install lighttpd, I ran yum, like so:

1
2
[root@pluto ~]# yum install lighttpd
[root@pluto ~]# yum install lighttpd-fastcgi

For the actual script, I thought about using Python or Ruby as a CGI script, then ultimately settled on PHP, which was pre-installed on the box as part of the default Fedora Core 4 install. The nice thing about PHP is that it does not need to be set up as an actual CGI program, you can simply embed PHP calls into an HTML page running inside a PHP enabled webserver. This was ideal for me, since my intention was to be able to build the dynamic page and hand it off to someone else to run, so with PHP, all that person would have to do is drop the page into a PHP-enabled server and run it.

To configure lighttpd for PHP, I found this Nixcraft article on Lighttpd/PHP/fastcgi configuration which I pretty much followed to the letter. The only difference between the configuration described and mine was that I reassigned the server.port variable to 81 from the default 80, since Apache was already listening on port 80.

1
2
## bind to port (default: 80)
server.port                = 81

To start lighttpd, I use the standard command to manually start it from /etc/init.d. Obviously, if I wanted it started always by default (I didn't), I could use chkconfig (or just symlinks) to do this.

1
2
3
[root@pluto ~]# /etc/init.d/lighttpd restart
Stopping lighttpd:                                         [FAILED]
Starting lighttpd:                                         [  OK  ]

To verify that everything came up correctly, I pointed my browser to http://localhost:81 and I see the "Powered by Lighttpd" start page. Now I actually needed to build my PHP script page. Since I was getting back to PHP after a long time, I decided to build a "Hello World" page to get things moving. From the lighttpd.conf file, I knew that the docroot for lighttpd was on /srv/www/lighttpd/ (see server.document-root), so I build a test.php file in it, which is shown below:

1
2
3
4
5
6
7
<html>
  <head><title>PHP Test</title></head>
  <body>
    <?php $name="Sujit"; ?>
    Hello <?php echo $name; ?>
  </body>
</html>

And bringing up this page on http://localhost:81 predictably shows "Hello Sujit" on the page.

I hope this post has been useful. About two jobs back, I would routinely write dynamic pages (first in Perl CGI and then in PHP) and test them on my local Apache webserver. However, since then, I became a Java web developer and I always find that the default Apache webserver is repurposed as a dedicated front end to the Java application server. I found it mildly annoying that whenever I needed a vanilla webserver to share content and information (there have been a few occasions), I would have to think of other ways to do it. With lighttpd, I no longer have to. If you are or have been in a similar situation, I hope you will find the installation and configuration instructions for lighttpd and php easy to understand and use.