Saturday, February 10, 2007

Embedded Databases (for Java) smackdown

I have been looking for opportunities to use embedded databases in my Java programs, but so far I have not been able to justify doing this. If the data volume was large, I would end up using a fast relational database such as MySQL with application caching, or use in-memory data structures if the data volume was small.

In my previous post, I compared three auto-completion implementations, two of which used in-memory data structures. The third one used an in-memory HSQLDB implementation and the performance was an order of magnitude worse than the other two. However, in many cases, it may be acceptable to have slightly lower performance, if we have cleaner, more maintainable code as a result, and can compensate for the performance by using faster hardware.

For this post, I compare three popular embedded database solutions to do a set of simple exact match lookups. To provide a sense of scale, I also built an implementation using a simple HashMap on the one hand, and an implementation against a local MySQL database on the other. I will begin with my test harness code, followed by a short description and the code for each implementation, followed by my results.

Test Harness

The test harness is a simple JUnit class, with StopWatch calls to measure the times and a MemoryMXBean to measure the change in JVM size before and after the data is loaded. All the implementations implement the IEmbeddedDB interface, and each test exercises each implementation in turn, passing in a reference to the implementation to the doPerformanceTest() method. The data consists of about 2800 name-value pairs, read in from a flat file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
public class EmbeddedDbTest extends TestCase {

  File inputFile = new File("...");
  String[] terms = {...};

  public void testInMemoryDb() throws Exception {
    IEmbeddedDb embeddedDb = new InMemoryEmbeddedDb();
    doPerformanceTest(embeddedDb, null);
  }

  public void testHsqlDb() throws Exception {
    IEmbeddedDb embeddedDb = new HsqlEmbeddedDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("dbPropsFile", "/tmp/hpairdb.properties");
    props.put("dbLogFile", "/tmp/hpair.log");
    props.put("dbScriptFile", "/tmp/hpairdb.script");
    props.put("driverClassName", "org.hsqldb.jdbcDriver");
    props.put("jdbcUrl", "jdbc:hsqldb:file:/tmp/hpairdb");
    props.put("username", "sa");
    props.put("password", "");
    doPerformanceTest(embeddedDb, props);
  }

  public void testHsqlCachedDb() throws Exception {
    IEmbeddedDb embeddedDb = new HsqlEmbeddedDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("dbPropsFile", "/tmp/hpaircdb.properties");
    props.put("dbLogFile", "/tmp/hpairc.log");
    props.put("dbScriptFile", "/tmp/hpaircdb.script");
    props.put("driverClassName", "org.hsqldb.jdbcDriver");
    props.put("jdbcUrl", "jdbc:hsqldb:file:/tmp/hpaircdb");
    props.put("username", "sa");
    props.put("password", "");
    props.put("useCachedTable", "true");
    doPerformanceTest(embeddedDb, props);
  }

  public void testDerbyDb() throws Exception {
    IEmbeddedDb embeddedDb = new DerbyEmbeddedDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("dbDir", "/tmp/dpairdb");
    props.put("driverClassName", "org.apache.derby.jdbc.EmbeddedDriver");
    props.put("jdbcUrl", "jdbc:derby:/tmp/dpairdb;create=true");
    doPerformanceTest(embeddedDb, props);
  }

  public void testBerkeleyDb() throws Exception {
    IEmbeddedDb embeddedDb = new BdbEmbeddedDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("dirName", "/tmp");
    props.put("dbName", "bpairdb");
    doPerformanceTest(embeddedDb, props);
  }

  public void testMySqlDb() throws Exception {
    IEmbeddedDb embeddedDb = new MySqlDb();
    Map<String,String> props = new HashMap<String,String>();
    props.put("driverClassName", "com.mysql.jdbc.Driver");
    props.put("jdbcUrl", "jdbc:mysql://localhost:3306/test");
    props.put("username", "root");
    props.put("password", "mysql");
    doPerformanceTest(embeddedDb, props);
  }

  private void doPerformanceTest(IEmbeddedDb embeddedDb, Map<String,String> props)
      throws Exception {
    MemoryMXBean memoryMxBean = ManagementFactory.getMemoryMXBean();
    String testName = ClassUtils.getShortClassName(embeddedDb.getClass());
    if (props != null && "true".equals(props.get("useCachedTable"))) {
      testName += " [cached]";
    }
    embeddedDb.init(props);
    StopWatch watch = new StopWatch(testName);
    long startHeapUsage = memoryMxBean.getHeapMemoryUsage().getUsed();
    watch.start("load " + testName);
    embeddedDb.load(inputFile);
    watch.stop();
    long endHeapUsage = memoryMxBean.getHeapMemoryUsage().getUsed();
    long heapUsedByDb = endHeapUsage - startHeapUsage;
    watch.start("get " + testName);
    for (int i = 0; i < 1000; i++) {
      for (String term : terms) {
        String value = embeddedDb.get(term);
        assertNotNull(value);
      }
    }
    watch.stop();
    embeddedDb.destroy();
    System.out.println(watch.prettyPrint());
    System.out.println("Heap usage (bytes):" + heapUsedByDb);
  }
}

Baseline 1: In-memory HashMap

We first created an IEmbeddedDb.java interface which is implemented by all the various implementations. Here it is:

1
2
3
4
5
6
7
// IEmbeddedDb.java
public interface IEmbeddedDb {
  public void init(Map<String,String> props) throws Exception;
  public void load(File file) throws Exception;
  public String get(String key);
  public void destroy() throws Exception;
}

Our HashMap implementation is trivial, and exists only to provide a lower bound to our results. Ideally, nothing should perform better than this, unless we are doing something terribly wrong. The code for InMemoryDb is here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// InMemoryEmbeddedDb.java
public class InMemoryEmbeddedDb implements IEmbeddedDb {

  private HashMap<String,String> hashMap;

  public void init(Map<String, String> props) throws Exception {
    hashMap = new HashMap<String,String>();
  }

  public void load(File file) throws Exception {
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] nvp = line.split("=");
      hashMap.put(nvp[0], nvp[1]);
    }
    reader.close();
  }

  public String get(String key) {
    return hashMap.get(key);
  }

  public void destroy() throws Exception {;}
}

HSQLDB : Memory table mode

The default table created with CREATE TABLE in HSQL are memory tables, which means that the entire data set is in memory. Obviously this does not buy us much, other than an SQL interface to read and write data into memory, and arguably this is not something worth buying in our simple case. We abstract out common boilerplate JDBC code into an abstract superclass. Each SQL implementation can then only pass in the Connection object to the superclass.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// AbstractSqlDb
public class AbstractSqlDb {

  protected String createTableSql = "create table pairs(" +
      "keycol varchar(32) not null, " +
      "valcol varchar(32) not null, " +
      "primary key (keycol))";

  private Connection conn = null;
  private PreparedStatement psIns = null;
  private PreparedStatement psGet = null;

  protected void init(Connection conn) throws Exception {
    this.conn = conn;
    Statement ddlStmt = conn.createStatement();
    ddlStmt.execute(createTableSql);
    psIns = conn.prepareStatement("insert into pairs (keycol, valcol) values (?, ?)");
    psGet = conn.prepareStatement("select valcol from pairs where keycol = ?");
  }

  public void load(File file) throws Exception {
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    String line = null;
    conn.setAutoCommit(false);
    try {
      while ((line = reader.readLine()) != null) {
        if (line.startsWith("#")) {
          continue;
        }
        String[] nvp = line.split("=");
        psIns.setString(1, nvp[0]);
        psIns.setString(2, nvp[1]);
        psIns.executeUpdate();
      }
      conn.commit();
    } catch (SQLException e) {
      conn.rollback();
    } finally {
      psIns.close();
      conn.setAutoCommit(true);
    }
    reader.close();
  }

  public String get(String key) {
    String value = null;
    try {
      psGet.setString(1, key);
      ResultSet rs = psGet.executeQuery();
      while (rs.next()) {
        value = rs.getString(1);
        break;
      }
      rs.close();
      return value;
    } catch (SQLException e) {
      return null;
    }
  }

  public void destroy() throws Exception {
    psGet.close();
    conn.close();
  }
}

// HsqlEmbeddedDb.java
public class HsqlEmbeddedDb extends AbstractSqlDb implements IEmbeddedDb {

  public void init(Map<String,String> props) throws Exception {
    new File(props.get("dbPropsFile")).delete();
    new File(props.get("dbLogFile")).delete();
    new File(props.get("dbScriptFile")).delete();
    Class.forName(props.get("driverClassName"));
    Connection conn = DriverManager.getConnection(props.get("jdbcUrl"),
      props.get("username"), props.get("password"));
    if ("true".equals(props.get("useCachedTable"))) {
      super.createTableSql = "create cached table pairs(" +
        "keycol varchar(32) not null, " +
        "valcol varchar(32) not null, " +
        "primary key (keycol))";
    }
    super.init(conn);
  }
}

HSQLDB : Cached table mode

The code for this implementation is the same as the previous one, except that we override the protected variable createTableSql to create a CACHED table explicitly. The documentation states that this may result in a loss of performance, but surprisingly in my case, it actually performs better.

Apache Derby : Embedded mode

According to the Derby FAQ, Derby does not support an in-memory mode. So we just implement a standard SQL implementation of our IEmbeddedDb interface.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// DerbyEmbeddedDb.java
public class DerbyEmbeddedDb extends AbstractSqlDb implements IEmbeddedDb {

  public void init(Map<String,String> props) throws Exception {
    try {
      FileUtils.cleanDirectory(new File(props.get("dbDir")));
      new File(props.get("dbDir")).delete();
    } catch (IllegalArgumentException e) {
      // no directory exists, nothing to do here
    }
    Class.forName(props.get("driverClassName"));
    Connection conn = DriverManager.getConnection(props.get("jdbcUrl"));
    super.init(conn);
  }
}

Berkeley DB

Berkeley DB has a proprietary interface to store and get data. William Grosso provides a nice tutorial in his "Berkeley DB, Java Edition 1: the basics" article on OnJava.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// BdbEmbeddedDb.java
public class BdbEmbeddedDb implements IEmbeddedDb {

  private Environment environment;
  private Database bdb;

  public void init(Map<String, String> props) throws Exception {
    EnvironmentConfig environmentConfig = new EnvironmentConfig();
    environmentConfig.setAllowCreate(true);
    File file = new File(props.get("dirName"));
    environment = new Environment(file, environmentConfig);
    DatabaseConfig databaseConfig = new DatabaseConfig();
    databaseConfig.setAllowCreate(true);
    bdb = environment.openDatabase(null, props.get("dbName"), databaseConfig);
  }

  public void load(File file) throws Exception {
    BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
    String line = null;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        continue;
      }
      String[] pair = line.split("=");
      DatabaseEntry key = new DatabaseEntry(pair[0].getBytes());
      DatabaseEntry value = new DatabaseEntry(pair[1].getBytes());
      bdb.put(null, key, value);
    }
    reader.close();
  }

  public String get(String key) {
    DatabaseEntry keyEntry = new DatabaseEntry(key.getBytes());
    DatabaseEntry valueEntry = new DatabaseEntry();
    try {
      OperationStatus status = bdb.get(null, keyEntry, valueEntry, LockMode.DEFAULT);
      if (status == OperationStatus.SUCCESS) {
        return new String(valueEntry.getData());
      } else {
        return null;
      }
    } catch (DatabaseException e) {
      return null;
    }
  }

  public void destroy() throws Exception {
    bdb.close();
    environment.close();
  }
}

Baseline 2: MySQL

My final implementation is based on MySQL, in order to provide a baseline on the other end. MySQL is generally accepted as the fastest relational database around, so any embedded solution that performs worse than MySQL is probably not worth considering.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
// MySqlDb.java
public class MySqlDb extends AbstractSqlDb implements IEmbeddedDb {

  public void init(Map<String, String> props) throws Exception {
    Class.forName(props.get("driverClassName"));
    Connection conn = DriverManager.getConnection(props.get("jdbcUrl"),
      props.get("username"), props.get("password"));
    Statement dropTableStmt = conn.createStatement();
    dropTableStmt.execute("drop table pairs");
    super.init(conn);
  }
}

Results

The results of running our EmbeddedDbTest against each of the following implementations are shown below. They are arranged in descending order of query time, which is the total time to run 1000 queries against each database.

Implementation Version Load time (ms) Query time (ms) Approx. memory (bytes)
In-memory HashMap N/A 89 4 567944
HSQLDB (memory) 1.8.0.7 167 156 460984
HSQLDB (cached) 1.8.0.7 155 81 485096
Berkeley DB JE 2.1.30 762 205 255680
Apache Derby 10.2.2.0 1034 2169 255624
MySQL 4.1.12 704 2755 152096

Conclusion

I had originally thought that Berkeley DB would come in first, followed by Apache Derby, followed by HSQLDB. In some respects, this assumption was based on this blog entry by David Van Couvering. I mean, if Oracle benchmarks its product against Derby, Derby must be a worthy competitor, right? Derby turned out to be a dog, though, only narrowly beating out a standard MySQL database.

Based on my results, however, the best performer turned out to be HSQLDB. I can understand the first result, where HSQLDB with an in-memory table (all data in memory), in which case, as expected, it compares very unfavorably with a HashMap because of the overhead of an SQL engine. However, in caching table mode, the behavior is similar to Berkeley DB, only part of the data is in memory, and more are pulled in as needed from disk. I had expected Berkeley DB to perform better because it does not have the overhead of an SQL Engine, but my results proved otherwise. I do notice that the memory usage for Berkeley DB is lower than HSQLDB, so the results may be different for larger data sets.

Obviously, there are some important caveats. I am not using the latest versions of MySQL and Berkeley DB JE. There is no sinister reason for this other than pure laziness. I already happened to have MySQL 4.1.12 installed on my laptop, and Berkeley DB JE 2.1.30 was the latest version in the Maven2 repository. It is very possible that the performance would be much better for the latest versions of these databases, but this is a quick and dirty evaluation I did for my own knowledge, feel free to run the provided code against later versions and give me updates. The second caveat is that I don't know much about tuning Apache Derby, so its possible that the one that comes out of the box is not tuned for performance. For what it is worth, though, I know nothing about tuning HSQLDB either, and the numbers for it are based on what came out of the box.

4 comments (moderated to prevent spam):

eddstar said...

Hello Sujit,
I am trying to accomplish a similar task except I am attempting to use Spring to connect to the MemoryMXBean.

My spring config file is:
bean id="memProxy" class="org.springframework.jmx.access.MBeanProxyFactoryBean">
property name="objectName" value="java.lang:type=Memory"/>
property name="proxyInterface" value="java.lang.management.MemoryMXBean"/>
/bean>

My java code is then:
logger.info("**********Point A**********");

ApplicationContext context = new ClassPathXmlApplicationContext("..\\jmxTest.xml");

logger.info("**********Point B**********");

MemoryMXBean memoryMXBean = (MemoryMXBean) context.getBean("memProxy");

logger.info("***********Point C*********");

memoryMXBean.gc();

logger.info("***********Point D*********");

long memUsage = memoryMXBean.getHeapMemoryUsage().getUsed();

logger.info("***********Point E*********");


My application is blowing up afer the pointD print statement. Do you have any ideas why this is happening? Am I using the correct interface when defining the Spring Proxy? Any comments would be greatly appreciated, thanks.

Anonymous said...

You might want to check this blog entry - This will help in finding out why some embedded databases are faster than others. If durability is an issue for your application, then logging database transactions as soon as they occur is a logical and given thing for database system.

Sujit Pal said...

Hi eddstart, not sure what happens when you say "blows up", does it die with an OutOfMemoryException or does it hang or does it throw a NullPointerException? Just offhand though, I would try it without the gc call, unless you are really hurting for memory across tests.

Sujit Pal said...

Hi Fracois, yes the blog page is very interesting, I had a similar conversation with an Oracle DBA friend when I told him I was using MySQL in production :-). However, my use case for embedded databases is always going to be read-only denormalized data access. It would typically be built in batch mode from data from another master database, so neither speed (as long as its not a complete dog) nor durability (since I can rebuild if needed) is an issue there. So the comparison in my blog post is based on what numbers are useful to me.