Friday Dec 05, 2008

User-defined ordering of strings in Apache Derby

Apache Derby (and Java DB) has supported language-based ordering of strings (aka territory-based collation) since version 10.3.1.4. This means for instance that you can get an ORDER BY clause in an SQL query to sort the results alphabetically according to the rules of the language of your choice, instead of sorting the strings by the value of their Unicode codepoints, which is the default.

So when you're using the default codepoint-based ordering, your results would be ordered like this (note that 'a' comes after 'B'):

ij> connect 'jdbc:derby:TestDB;create=true';
ij> select \* from (values 'a','A','B') as s(x) order by x;
X
-
A
B
a

3 rows selected

Using language-based ordering for the en_US locale would give you a more natural ordering of the results (both 'a' and 'A' come before 'B'):

ij> connect 'jdbc:derby:EnglishDB;create=true;territory=en_US;collation=TERRITORY_BASED';
ij> select \* from (values 'A','B','a') as s(x) order by x;
X
-
a
A
B

3 rows selected

This feature is very useful as long as you're happy with the ordering provided by the locales supported by your Java Runtime Environment. But sometimes you may want to sort your strings based on some other rules than those in the languages supported by the JRE. The most common example is that users want case-insensitive string matching. Another example came up recently in this thread on derby-user, where a user wanted to modify the German collation rules to make Greek characters sort near their Latin equivalents ('α' near 'a', 'β' near 'b', and so on).

Derby currently (as of 10.4) doesn't allow you to define your own collation rules, it merely relies on the collators returned by java.text.Collator.getInstance(java.util.Locale). Fortunately, the JRE allows you to define new locales with their own collation rules. And it turns out that it's quite easy, just perform these three simple steps:

  • Create a class that extends java.text.spi.CollatorProvider and returns a collator that orders strings the way you want it to
  • Create a text file named META-INF/services/java.text.spi.CollatorProvider which contains one line with the name of your collator provider class
  • Put the compiled class file and the text file in a jar file which you drop into your JRE's lib/ext directory or in one of the directories specified by the java.ext.dirs property

Let's look at an example. Collators can have different strengths, which tells which differences to regard as significant when comparing strings. Case difference is a tertiary difference, so to get a case-insensitive collator we could set the strength to secondary and the case differences will be ignored. If we want to define a variant of the en_US locale, say en_US_caseinsensitive, whose collators follow the ordering rules of the English language and ignore case differences, we could write this small CollatorProvider:

public class MyCollatorProvider extends CollatorProvider {
    public Locale[] getAvailableLocales() {
        return new Locale[] {
            new Locale("en", "US", "caseinsensitive")
        };
    }
    public Collator getInstance(Locale locale) {
        Collator c = Collator.getInstance(Locale.US);
        // Ignore tertiary differences (case differences)
        c.setStrength(Collator.SECONDARY);
        return c;
    }
}

Now compile the class and package it in a jar, together with a text file in META-INF/services as described above. In the directory where the jar is stored, start ij (Derby's SQL client) with the following command:

$ java -Djava.ext.dirs=. -jar /path/to/derbyrun.jar ij
ij version 10.4
ij> 

Then create a database which uses the new locale that you just defined, and insert some test data and see that the string matching really is case-insensitive:

ij> connect 'jdbc:derby:NoCaseDB;territory=en_US_caseinsensitive;collation=TERRITORY_BASED;create=true';
ij> create table t (x varchar(12));
0 rows inserted/updated/deleted
ij> insert into t values 'hi!','Hi!','hello','Hello','HELLO','HeLlO';
6 rows inserted/updated/deleted
ij> select \* from t where x = 'hi!';
X           
------------
hi!         
Hi!         

2 rows selected
ij> select distinct \* from t order by x;
X           
------------
hello       
hi!         

2 rows selected
ij> select x, count(\*) from t group by x;
X           |2          
------------------------
hello       |4          
hi!         |2          

2 rows selected

If you want to define your own custom ordering, like having Greek characters in between Latin characters, you could define another locale with a CollatorProvider that returns a RuleBasedCollator with whichever rules you want. See its class javadoc for details about how you specify rules. In its simplest form, a set of rules might look like "a,A < b,B < c,C", which means more or less that a and A should be sorted before b and B, which should be sorted before c and C. So to get the Greek characters sorted near similar Latin characters, define a CollatorProvider that looks like this one:

public class MyCollatorProvider extends CollatorProvider {
    public Locale[] getAvailableLocales() {
        return new Locale[] {
            new Locale("en", "US", "greek")
        };
    }
    public Collator getInstance(Locale locale) {
        StringBuilder rules = new StringBuilder();
        // alpha should go between a and b
        rules.append("< a,A < \\u03b1 < b,B");
        // beta should go between b and c
        rules.append("& b,B < \\u03b2 < c,C");
        // add more rules here ....
        try {
            return new RuleBasedCollator(rules.toString());
        } catch (ParseException pe) {
            throw new Error(pe);
        }
    }
}

Again, put the compiled class and META-INF/services/java.text.spi.CollatorProvider in a jar file, and start ij with -Djava.ext.dirs=. in the directory where the jar file is located. Create a database which uses the new locale and insert some data with both Greek and Latin characters:

ij> connect 'jdbc:derby:GreekDB;territory=en_US_greek;collation=TERRITORY_BASED;create=true';
ij> create table t (x varchar(12));
0 rows inserted/updated/deleted
ij> insert into t values 'a', 'b', 'c', 'α', 'β';
5 rows inserted/updated/deleted
ij> select \* from t order by x;
X           
------------
a           
α           
b           
β           
c           

5 rows selected

Look, the ordering was just like we wanted it to be, with the Greek characters in between the Latin ones, and not at the end where they would normally be located!

One final word of caution: If you ever update your custom CollatorProvider so that the ordering is changed after you have created a database, you will need to recreate the database. This is because the indexes in the database are ordered, and you may see strange results if the indexes are ordered with a different collator than the one your database is currently using.

Want to try this out yourself? Here's a Java class that implements the CollatorProvider for the two examples. Put it in a directory together with this Ant script, and execute ant to create a jar file with the required file in META-INF/services (requires Apache Ant 1.7.0 or later).

About

kah

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today