X

Sorting Strings

Guest Author
by John Zukowski

Sorting strings with the Java platform can be thought of
as an easy task, but there is much more thought that should
be put into it when developing programs for an international
market. If you're stuck in the English-only mindset, and you
think your program works fine because it shows that the string
tomorrow comes after today, you might think all is great. But,
once you have a Spanish user who wants mañana to be sorted
properly, if all you use is the default compare() method of String
for sorting, the ñ character will come after the z character
and will not be the natural Spanish ordering, between the n character
and o character. That's where the Collator class of the java.text
package comes into play.

Imagine a list of words

  • first
  • mañana
  • man
  • many
  • maxi
  • next

Using the default sorting mechanism of String, its compare() method,
this will result in a sorted list of:

  • first
  • man
  • many
  • maxi
  • mañana
  • next

Here, mañana comes between maxi and next. In the Spanish world, what
should happen is mañana should come between many and maxi as the ñ character
(pronounced eñe) comes after the n in that alphabet. While you could write
your own custom sort routine to handle the ñ, what happens to your program
when a German user comes around and wants to use their own diacritical
marks, or what about just a list of design patterns with façade? Do
you want façade before or after factory? (Essentially treating the
ç with the little cedilla hook the same as c or different.)

That's where the Collator class comes in handy. The Collator class
takes into account language-sensitive sorting issues and doesn't just
try to sort words based upon their ASCII/Unicode character values.
Using Collator requires understanding one additional property before
you can fully utilize its features, and that is something called
strength. The strength setting of the Collator determines how
strong (or weak) a match is used for ordering. There are four possible
values for the property: PRIMARY, SECONDARY, TERTIARY, and IDENTICAL.
What actually happens with each is dependent on the locale. Typically
what happens is as follows. In reverse order, IDENTICAL strength means
just that, the characters must be identical for them to be treated the
same. TERTIARY typically is for ignoring case differences. SECONDARY
is for ignoring diacritical marks, like n vs. ñ. PRIMARY is like
IDENTICAL for base letter differences, but has some differences when
handling control characters and accents. See the Collator javadoc
for more information on these differences and decomposition mode rules.

To work with Collator, you need to start by getting one. You can either
call getInstance() to get one for the default locale, or pass the
specific Locale to the getInstance() method to get a locale for
the one provided. For instance, to get one for the Spanish language,
you would create a Spanish Locale with new Locale("es") and then
pass that into getInstance():

 Collator esCollator =
Collator.getInstance(new Locale("es"));

Assuming the default Collator strength for the locale is sufficient,
which happens to be SECONDARY for Spanish,
you would then pass the Collator like any Comparator into the
sort() routine of Collections to get your sorted List:

 Collections.sort(list, esCollator);

Working with the earlier list, that now gives you a proper
sorting with the Spanish alphabet:

  • first
  • man
  • many
  • mañana
  • maxi
  • next

Had you instead used the US Locale for the Collator, mañana would
appear between man and many since the ñ is not its own letter.

Here's a quick example that shows off the differences.

import java.awt.\*;
import java.text.\*;
import java.util.\*;
import java.util.List; // Explicit import required
import javax.swing.\*;
public class Sort {
public static void main(String args[]) {
Runnable runner = new Runnable() {
public void run() {
String words[] = {"first", "mañana", "man",
"many", "maxi", "next"};
List list = Arrays.asList(words);
JFrame frame = new JFrame("Sorting");
frame.setDefaultCloseOperation (JFrame.EXIT_ON_CLOSE);
Box box = Box.createVerticalBox();
frame.setContentPane(box);
JLabel label = new JLabel("Word List:");
box.add(label);
JTextArea textArea = new JTextArea( list.toString());
box.add(textArea);
Collections.sort(list);
label = new JLabel("Sorted Word List:");
box.add(label);
textArea = new JTextArea(list.toString ());
box.add(textArea);
Collator esCollator = Collator.getInstance(new Locale("es"));
Collections.sort(list, esCollator);
label = new JLabel("Collated Word List:");
box.add(label);
textArea = new JTextArea(list.toString());
box.add(textArea);
frame.setSize(400, 200);
frame.setVisible(true);
}
};
EventQueue.invokeLater (runner);
}
}


One last little bit of information about collation. The
Collator returned by the getInstance() call is typically an instance
of RuleBasedCollator for the supported languages. You can use
RuleBasedCollator to define your own collation sequences. The javadoc
for the class describes the rule syntax more completely, but lets say
you had a four character alphabet and wanted the order of the letters
to be CAFE instead of ACEF, your rule would look something like:

 String rule =
"< c, C < a, A < f, F < e, E";
RuleBasedCollator collator = new RuleBasedCollator(rule);

This defines the explicit order as cafe, with the different letter
cases shown. Now, for a list of words of ace, cafe, ef, and face,
the resultant sort order is cafe, ace, face, and ef with the new rules:

import java.text.\*;
import java.util.\*;
public class Rule {
public static void main(String args[]) throws ParseException {
String words[] = {"ace", "cafe", "ef", "face"};
String rule ="< c, C < a, A < f, F < e, E";
RuleBasedCollator collator = new RuleBasedCollator(rule);
List list = Arrays.asList(words);
Collections.sort(list, collator);
System.out.println(list);
}
}

After compiling and running, you see the words sorted with the new
rules:

> javac Rule.java
> java Rule
[cafe, ace, face, ef]

After reading the rule syntax some more in the javadocs, try to expand the
alphabet and work with the different diacritical marks, too.

Now, when developing programs for the global world, your programs
can be better prepared to suit the local user. Be sure to keep strings
in resource bundles, too, as shown in an earlier tip:

Earlier tip

\*\*\*\*\*\*\*\*\*


Connect and Participate With GlassFish

Try GlassFish for a chance to win an iPhone. This sweepstakes ends on March 23, 2008. Submit your entry today.




Foote on Blu-ray Disc Java In this video interview, Sun's Blu-ray Disc Java (BDJ) architect Bill Foote talks about this powerful technology and shows some examples of BDJ code and applications. Download the code

Join the discussion

Comments ( 6 )
  • jlinkamp Thursday, January 24, 2008

    Thanks by article i don't know this


  • Michael Rogers Thursday, January 24, 2008

    I'd like to see more tips likes this in the future. Useful classes that I was not aware of ;-)


  • &Ouml;zmen Ad&#305;belli Friday, January 25, 2008

    Useful information for me as a Turkish software developer. Java is always helpful about this kind of differences. Thanks.


  • double density design Saturday, February 2, 2008
    [Trackback] Sharepoint
    Tips and Tricks: Use a lookup table instead of a choice dropdown list
    A new SharePoint Deployment Essentials Guide made up of content from the previous SharePoint Governance Checklist PLUS additional materials explained hereafter just becam...
  • Girish Friday, February 8, 2008

    hi

    nice hint


  • manohar Saturday, March 1, 2008

    i want to know about the strings expashally reverse strings plz give the notes of reverse strings


Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services