Thursday Jan 27, 2011

How to use Berkeley DB's non-SQL, Key/Value API to implement "SELECT * FROM table WHERE X = key ORDER BY Y"

In this post we explore how Berkeley DB's key/value API can support complex SQL-like queries without having to use the SQL API itself. For this example we will consider the query, "SELECT * FROM table WHERE X = key ORDER BY Y". To perform this query we will be using a composite secondary index along side the primary database table to construct the join.
First of all, it's important to think of Berkeley DB as nearly identical to the storage engine underneath any RDBMS. In fact for many years Berkeley DB was the first transactional data storage engine integrated with MySQL, we predate InnoDB in that regard. As such, Berkeley DB has API calls and access methods that can support any RDBMS query. This is further demonstrated by the new Berkeley DB SQL API. This new API is simply the marriage of the SQL processing layer of SQLite and the B-Tree storage of Berkeley DB.
In this example we will set aside the fact that Berkeley DB has a SQL API and show you how to use it as a customizable storage engine that executes the equivalent of the example query. In this case your application has to provide the code that accesses the data store with an appropriate sequence of steps that will implement the behavior that you want.

If you have two indices in SQL, each on a single column (call them X and Y), and you do: SELECT * FROM table WHERE X = key ORDER BY Y; then there are three plausible query plans:

  1. scan the whole table, ignore both indices, filter by X = key then sort by Y;

  2. use the index on Y to scan all rows in the required order, filter by X = key;

  3. use the index on X, find the matching rows, then sort by Y.

There are cases where (1) would be fastest, because it has all of the columns from one scan (the other query plans will do random lookups on the primary for each row). This assumes that the data can fit into memory and the sort is fast.
Query plan (2) will be fastest if the selectivity is moderate to high, looking up rows in the main table is fast, and sorting the rows is very slow for some reason (e.g., some complex collation).
Query plan (3) will be fastest if the selectivity is small (only a small percentage of the rows in the table matches). This should be the best case for us, making it the best choice in a Berkeley DB key/value application.
The optimal plan would result from having a composite index on (X, Y), which can return just the desired rows in the desired order. Of course, it does cost additional time and space to maintain that index. But note that you could have this index instead of a simple index on X: it can be used in any query the simple index could be used in.
Records in Berkeley DB are (key, value) pairs and Berkeley DB supports only a few logical operations on these records. They are:

  • Insert a record in a table.

  • Delete a record from a table.

  • Find a record in a table by looking up its key (or by positioning a cursor).

  • Update a record that has already been found.

Notice that Berkeley DB never operates on the value part of a record. Values are simply payload, to be stored with keys and reliably delivered back to the application on demand. Both keys and values can be arbitrary byte strings, either fixed-length or variable-length.
So, in case of a SELECT * FROM X WHERE id=Y ORDER BY Z query, our suggestion, from Berkeley DB's point of view, would be for you to use a composite index (as it would be in SQL), where a string created as X_Y should do the trick, as explained in the following scenario.

1 10 abc
2 10 aab
3 20 bbc
4 10 bba
5 20 bac
6 30 cba


10_aab 2
10_abc 1
10_bba 4
20_bac 5
20_bbc 3
30_cba 6

If the query looks like this:
SELECT * FROM primarydb WHERE X = 10 ORDER by Y
the application can run a cursor on the secondary and begin the loop with the DB_SET_RANGE flag on 10. When iterating with a cursor executing DB_NEXT, this will return:

2 10 aab
1 10 abc
4 10 bbc

The application must check for the end of the range inside the loop, in this case it should stop when it hits 20_bac.
As in SQL, retrieving by a secondary key is remarkably similar to retrieving by a primary key and the Berkeley DB call will look similar to its primary equivalent.


The approach should perform very well and is likely to be the fastest solution. Of course, you can try other methods and tune the database for performance at a later time, after you have the functionality in place. The first and most critical performance factor is almost always cache size. First increase cache size and measure your performance to find an optimal amount for your system and use case. Second, you might test with a bigger database page size that has a more direct relationship to the average size of your keys and values. Lastly, consider the various configuration options available within the transactional subsystem (log buffer size, trickle thread, etc.).

If you are not very familiar with how to implement the above in BDB, please read the Guide to Oracle Berkeley DB for SQL Developers.

You will also need to be familiar with secondary indexes, cursor operations such as DBcursor->get() and DB_SET_RANGE.


Information about Berkeley DB products directly from the people who build them.


« August 2016