Sorting Strings

by John Zukowski

Sorting strings with the Java platform can be thought of as an easy task, but there is much more thought that should be put into it when developing programs for an international market. If you're stuck in the English-only mindset, and you think your program works fine because it shows that the string tomorrow comes after today, you might think all is great. But, once you have a Spanish user who wants mañana to be sorted properly, if all you use is the default compare() method of String for sorting, the ñ character will come after the z character and will not be the natural Spanish ordering, between the n character and o character. That's where the Collator class of the java.text package comes into play.

Imagine a list of words

  • first
  • mañana
  • man
  • many
  • maxi
  • next

Using the default sorting mechanism of String, its compare() method, this will result in a sorted list of:

  • first
  • man
  • many
  • maxi
  • mañana
  • next

Here, mañana comes between maxi and next. In the Spanish world, what should happen is mañana should come between many and maxi as the ñ character (pronounced eñe) comes after the n in that alphabet. While you could write your own custom sort routine to handle the ñ, what happens to your program when a German user comes around and wants to use their own diacritical marks, or what about just a list of design patterns with façade? Do you want façade before or after factory? (Essentially treating the ç with the little cedilla hook the same as c or different.)

That's where the Collator class comes in handy. The Collator class takes into account language-sensitive sorting issues and doesn't just try to sort words based upon their ASCII/Unicode character values. Using Collator requires understanding one additional property before you can fully utilize its features, and that is something called strength. The strength setting of the Collator determines how strong (or weak) a match is used for ordering. There are four possible values for the property: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL. What actually happens with each is dependent on the locale. Typically what happens is as follows. In reverse order, IDENTICAL strength means just that, the characters must be identical for them to be treated the same. TERTIARY typically is for ignoring case differences. SECONDARY is for ignoring diacritical marks, like n vs. ñ. PRIMARY is like IDENTICAL for base letter differences, but has some differences when handling control characters and accents. See the Collator javadoc for more information on these differences and decomposition mode rules.

To work with Collator, you need to start by getting one. You can either call getInstance() to get one for the default locale, or pass the specific Locale to the getInstance() method to get a locale for the one provided. For instance, to get one for the Spanish language, you would create a Spanish Locale with new Locale("es") and then pass that into getInstance():

 Collator esCollator =
   Collator.getInstance(new Locale("es"));

Assuming the default Collator strength for the locale is sufficient, which happens to be SECONDARY for Spanish, you would then pass the Collator like any Comparator into the sort() routine of Collections to get your sorted List:

 Collections.sort(list, esCollator);

Working with the earlier list, that now gives you a proper sorting with the Spanish alphabet:

  • first
  • man
  • many
  • mañana
  • maxi
  • next

Had you instead used the US Locale for the Collator, mañana would appear between man and many since the ñ is not its own letter.

Here's a quick example that shows off the differences.

import java.awt.\*;
import java.text.\*;
import java.util.\*;
import java.util.List; // Explicit import required
import javax.swing.\*;

public class Sort {
 public static void main(String args[]) {
   Runnable runner = new Runnable() {
     public void run() {
       String words[] = {"first", "mañana", "man",
                         "many", "maxi", "next"};
       List list = Arrays.asList(words);
       JFrame frame = new JFrame("Sorting");
       frame.setDefaultCloseOperation (JFrame.EXIT_ON_CLOSE);
       Box box = Box.createVerticalBox();
       frame.setContentPane(box);
       JLabel label = new JLabel("Word List:");
       box.add(label);
       JTextArea textArea = new JTextArea( list.toString());
       box.add(textArea);
       Collections.sort(list);
       label = new JLabel("Sorted Word List:");
       box.add(label);
       textArea = new JTextArea(list.toString ());
       box.add(textArea);
       Collator esCollator = Collator.getInstance(new Locale("es"));
       Collections.sort(list, esCollator);
       label = new JLabel("Collated Word List:");
       box.add(label);
       textArea = new JTextArea(list.toString());
       box.add(textArea);
       frame.setSize(400, 200);
       frame.setVisible(true);
     }
   };
   EventQueue.invokeLater (runner);
 }
}

One last little bit of information about collation. The Collator returned by the getInstance() call is typically an instance of RuleBasedCollator for the supported languages. You can use RuleBasedCollator to define your own collation sequences. The javadoc for the class describes the rule syntax more completely, but lets say you had a four character alphabet and wanted the order of the letters to be CAFE instead of ACEF, your rule would look something like:

 String rule =
   "< c, C < a, A < f, F < e, E";
 RuleBasedCollator collator = new RuleBasedCollator(rule);

This defines the explicit order as cafe, with the different letter cases shown. Now, for a list of words of ace, cafe, ef, and face, the resultant sort order is cafe, ace, face, and ef with the new rules:

import java.text.\*;
import java.util.\*;

public class Rule {
 public static void main(String args[]) throws ParseException {
   String words[] = {"ace", "cafe", "ef", "face"};
   String rule ="< c, C < a, A < f, F < e, E";
   RuleBasedCollator collator = new RuleBasedCollator(rule);
   List list = Arrays.asList(words);
   Collections.sort(list, collator);
   System.out.println(list);
 }
}

After compiling and running, you see the words sorted with the new rules:

> javac Rule.java
> java Rule
[cafe, ace, face, ef]

After reading the rule syntax some more in the javadocs, try to expand the alphabet and work with the different diacritical marks, too.

Now, when developing programs for the global world, your programs can be better prepared to suit the local user. Be sure to keep strings in resource bundles, too, as shown in an earlier tip:

Earlier tip

\*\*\*\*\*\*\*\*\*

Connect and Participate With GlassFish
Try GlassFish for a chance to win an iPhone. This sweepstakes ends on March 23, 2008. Submit your entry today.


Foote on Blu-ray Disc Java In this video interview, Sun's Blu-ray Disc Java (BDJ) architect Bill Foote talks about this powerful technology and shows some examples of BDJ code and applications. Download the code

Comments:

Thanks by article i don't know this

Posted by jlinkamp on January 23, 2008 at 04:46 PM PST #

I'd like to see more tips likes this in the future. Useful classes that I was not aware of ;-)

Posted by Michael Rogers on January 23, 2008 at 09:32 PM PST #

Useful information for me as a Turkish software developer. Java is always helpful about this kind of differences. Thanks.

Posted by Özmen Adıbelli on January 24, 2008 at 07:29 PM PST #

[Trackback] Sharepoint Tips and Tricks: Use a lookup table instead of a choice dropdown list A new SharePoint Deployment Essentials Guide made up of content from the previous SharePoint Governance Checklist PLUS additional materials explained hereafter just becam...

Posted by double density design on February 01, 2008 at 06:12 PM PST #

hi
nice hint

Posted by Girish on February 07, 2008 at 07:53 PM PST #

i want to know about the strings expashally reverse strings plz give the notes of reverse strings

Posted by manohar on February 29, 2008 at 06:43 PM PST #

Post a Comment:
Comments are closed for this entry.
About

John O'Conner

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today