Faster new String(bytes, cs/csn) and String.getBytes(cs/csn)

String(byte[] bytes, String csn) and String.getBytes(String csn) pair (and their variants) have been widely used as the convenient methods for char[]/String<->byte[] encoding/decoding when

a) You don't want to get your hands "dirty" on the trivial decoding and/or encoding process (via java.nio.charset)  and
b) The default error handling action when there is/are un-mappable/malformed byte(s)/char(s) in the input array  is acceptable for your situation (though spec of some methods states the "behavior is unspecified", Sun JDK/JRE runtime always performs REPLACE in this case)

These methods are convenient, especially when dealing with "small" byte[] and char[].  However we have been hearing complains for years (including from our EE folks) that these methods are slow, especially when there is a SecurityManager installed, and also consume too much memory than "expected".

The complains are somewhat accurate. A simple new String(byte[] bb, String csn) takes a "long" journey down to the CharsetDecoder to do the real conversion as listed below


         ->at java.lang.String.<init>(String.java:523)
          ->at java.lang.String.<init>(String.java:451)
            ->at java.lang.StringCoding.decode(StringCoding.java:193)
              ->at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:160)
                ->at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:561)
                  ->at sun.nio.cs.SingleByte$Decoder.decodeLoop(SingleByte.java:106)


during the process, ByteBuffer and CharBuffer objects are created to wrap the source byte[] and the destinating char[],  decoder takes the byte[] and char[] in and out of the wrappers during decoding,  a long list of sanity checks, defensive copying operations have to be performed... regardless lots of performance tuning have been done already, they are truely slow and bloat, a sad reality.


Good news is we finally found ourself some time to look into the issue recently. With lots of trivial but effective tuning works here and there including a "fast pass/green lane" for "built-in" charsets (the charsets included in JDK/JRE's charset respository) we are able to  cut the "bloat" memory use during the de/coding and boost the coding speed quite lot for "small" size of byte[]/char[]/String objects. So far the results look pretty good (listed below). The change has been integrated into JDK7 b53.


(1)NetBeans Memory profiles of running StrCodingBenchMark against JDK7 b52 and b53 show the "bloat" 36%+ java.nio.HeapByte/CharBuffer  objects are completely gone.


b52 Result:



b53 result:





2)Speed: Out micro performance benchmark StrCodingBenchmark shows huge speed boost for small byte[]/String use scenario. Below is the result of using ISO-8859-1 charset for different de/encoding operations with different sizes of input and under different runtime configs.


[-server] server vm   [-client]  client vm   [+sm] with a SecurityManager installed
[String decode: csn] new String(byte[], "iso-8859-1")
[String decode: cs ] new String(byte[], Charset.forName("iso-8859-1"))
[String encode: csn] String.getBytes("iso-8859-1")
[String encode: cs ] String.getBytes(Charset.forName("iso-8859-1"))
\*The first 4 columns are JDKb53, the later 4 are results of b52, smaller the number is the faster

-server -client -server+sm -client+sm ------------b52/old-----------------
[len=4]
String decode: csn 29 52 27 51 61 140 117 215
String decode: cs 38 73 93 147 90 160 139 237
String encode: csn 23 43 24 43 64 116 135 188
String encode: cs 42 77 98 134 170 341 232 418
[len=8]
String decode: csn 31 60 32 59 65 148 118 225
String decode: cs 40 81 97 155 95 170 143 247
String encode: csn 27 51 27 49 65 125 136 197
String encode: cs 46 84 101 139 174 356 238 451
[len=16]
String decode: csn 36 72 35 71 68 164 122 241
String decode: cs 45 94 101 169 108 189 147 263
String encode: csn 31 63 31 63 68 142 141 214
String encode: cs 50 96 103 152 179 379 242 453
[len=32]
String decode: csn 47 101 46 101 76 197 131 274
String decode: cs 55 119 113 194 115 228 157 303
String encode: csn 40 89 40 89 78 173 149 248
String encode: cs 58 121 112 178 194 423 258 497
[len=64]
String decode: csn 68 152 66 159 97 262 149 339
String decode: cs 74 171 133 246 139 302 179 379
String encode: csn 56 140 56 139 94 243 165 316
String encode: cs 73 171 128 227 223 521 285 582
[len=128]
String decode: csn 108 256 104 254 130 389 189 465
String decode: cs 110 272 173 345 184 449 222 522
String encode: csn 91 238 88 237 126 371 195 458
String encode: cs 106 267 162 324 283 672 342 745
[len=256]
String decode: csn 187 462 179 459 197 645 242 721
String decode: cs 183 474 249 548 277 739 305 811
String encode: csn 171 437 150 435 190 642 257 712
String encode: cs 171 462 224 519 395 1021 447 1093
[len=512]
String decode: csn 349 877 332 874 335 1156 368 1229
String decode: cs 334 879 405 961 463 1349 476 1408
String encode: csn 284 830 277 828 317 1168 379 1240
String encode: cs 298 865 350 910 636 1698 673 1769
[len=1024]
String decode: csn 782 1708 742 1701 660 2176 665 2262
String decode: cs 741 1689 834 1785 865 2507 846 2568
String encode: csn 571 1631 555 1616 577 2228 631 2299
String encode: cs 559 1635 601 1691 1121 3051 1128 3118


Some more results are at benchmark.txt.


Two facts worth mentioning are
(1)The best gain achived when de/encoding "small" sizes of byte[]/char[]/String (when de/encoding on "big" chunk of data, the overhead we are trying to cut off becomes marginal)
(2)Using Charset as the parameter does not bring you any performance benefit (in fact it's slower and bloat) in most use scenarios, use it with caution.


For now all the SingleByte charsets (except JIS0201) in our repository are benefited from this change. I Hope I can find some time to work on the MultiByte charsets later, especially the UTF-8, yes, it's on my to-do list.



Comments:

Hi Xueming,

This is great work as it should benefit a lot of users. Agreed that UTF-8 is very important, so I hope you manage to find the time to work on that too. :)

Best,
Ismael

Posted by Ismael Juma on April 07, 2009 at 09:28 AM PDT #

Sweet!

Posted by Dane Foster on April 07, 2009 at 02:57 PM PDT #

I have a querry:

String enc = "some encoding"
byet[] byteArr = new byte[]{somebytes};
String s = new String(byteArr, enc);
byteResult = s.getBytes(enc);

byteArr and byteResult will be same if
(a) enc ="ISO-8859-1"
(b) enc ="MS932"
?
Ans is (a).
I believe that ISO-8859-1 does not replace the bytes if the characters are not found.
But I wanted to know the proof or spec of this phenomenon. Do you know anything about it?

Posted by Chandan on July 30, 2009 at 02:17 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

xuemingshen

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today