Thursday Mar 09, 2006

Java SE Tuning Tip: Large Pages on Windows and Linux

Enabling large page support on supported operating environments can give a significant boost to performance. This is especially true for applications with large datasets or running with large heap sizes. Below is a summary of how to enable large pages on Solaris, Windows, and Linux. The text is largely from the "HotSpot VM Options Page ":http://java.sun.com/docs/hotspot/VMOptions.html, but I've had a lot of questions about this and thought it merited highlighting the information here. Stay tuned for a revamped "HotSpot VM Options Page ":http://java.sun.com/docs/hotspot/VMOptions.html coming your way in the next few weeks. Beginning with Java SE 5.0 there is now a cross-platform flag for requesting large memory pages: -XX:+UseLargePages (on by default for Solaris, off by default for Windows and Linux). The goal of large page support is to optimize processor Translation-Lookaside Buffers. A Translation-Lookaside Buffer (TLB) is a page translation cache that holds the most-recently used virtual-to-physical address translations. TLB is a scarce system resource. A TLB miss can be costly as the processor must then read from the hierarchical page table, which may require multiple memory accesses. By using bigger page size, a single TLB entry can represent larger memory range. There will be less pressure on TLB and memory-intensive applications may have better performance. However please note sometimes using large page memory can negatively affect system performance. For example, when a large mount of memory is pinned by an application, it may create a shortage of regular memory and cause excessive paging in other applications and slow down the entire system. Also please note for a system that has been up for a long time, excessive fragmentation can make it impossible to reserve enough large page memory. When it happens, either the OS or JVM will revert to using regular pages. Operating system configuration changes to enable large pages:

Solaris

As of Solaris 9, which includes Multiple Page Size Support (MPSS), no additional configuration is necessary. If you're running 32-bit J2SE versions prior to J2SE 5.0 Update 5 on AMD Opteron hardware additional tuning is necessary. Due to a bug in HotSpot Large page code, the default large page size running the 32-bit x86 binary is 4mb. Since 4mb pages is not supported on Opteron, the large page request fails and the page size defaults to 8k. To get around this, explicitly set the large page size to 2mb with the following flag: -XX:LargePageSizeInBytes=2m

Linux

Large page support is included in 2.6 kernel. Some vendors have backported the code to their 2.4 based releases. To check if your system can support large page memory, try the following: # cat /proc/meminfo | grep Huge HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB If the output shows the three "Huge" variables then your system can support large page memory, but it needs to be configured. If the command doesn't print out anything, then large page support is not available. To configure the system to use large page memory, one must log in as root, then: # Increase SHMMAX value. It must be larger than the Java heap size. On a system with 4 GB of physical RAM (or less) the following will make all the memory sharable: # echo 4294967295 > /proc/sys/kernel/shmmax # Specify the number of large pages. In the following example 3 GB of a 4 GB system are reserved for large pages (assuming a large page size of 2048k, then 3g = 3 x 1024m = 3072m = 3072 \* 1024k = 3145728k, and 3145728k / 2048k = 1536): # echo 1536 > /proc/sys/vm/nr_hugepages Note the /proc values will reset after reboot so you may want to set them in an init script (e.g. rc.local or sysctl.conf). Also, internal testing has shown that root permissions may be necessary to get large page support on various flavors of Linux, most notably Suse SLES 9.

Windows

Only Windows Server 2003 supports large page memory. In order to use it, the administrator must first assign additional privilege to the user who will be running the application: # select Control Panel -> Administrative Tools -> Local Security Policy # select Local Policies -> User Rights Assignment # double click "Lock pages in memory", add users and/or groups # reboot the machine As always, every application is different and true performance is always defined by each individual running their own application. If you run into problems or have questions about Java performance visit the java.net performance forum or feel free to send me a comment.

Friday Feb 17, 2006

Java Performance: Solaris 10 x86 vs. Linux

Solaris 10 screams running Java. Competitive benchmarks do a good job highlighting this,just take a look at the latest SPECjbb2005 and SPECjappserver2004 results. I have noticed some fundamental differences in "Out of the Box" tuning when comparing Solaris and Linux. When running Java server applications, Solaris 10 default tuning is general purpose and tuned for moderate thread counts similar to a time shared system. This in many ways is an indication of the maturity of the platform. Linux, on the other hand, is specfically tuned for high thread counts and performance suffers when running low thread counts. A good example of this behavior can be seen comparing SPECjbb2005 results. Below are two results run on the exact same hardware, only differing the OS and minor JVM tuning (the heap tuning has minimal performance impact). SPECjbb2005 on Sun Fire X4200 running Solaris 10 Update 1, 49,097 SPECjbb2005 bops, 49,097 SPECjbb2005 bops/JVM SPECjbb2005 on Sun Fire X4200 running Red Hat EL 4, 43,076 SPECjbb2005 bops, 43,076 SPECjbb2005 bops/JVM Running SPECjbb2005 on identical hardware with optimal tuning parameters Solaris 10 is 14% faster than Linux. SPECjbb2005 on small x64 hardware runs only a moderate number of threads, in the above example to peak application thread count is 8. What tuning can be applied when running high thread counts on Solaris 10 x86? Here's two quick tuning steps you can try with your application. 1. If you're running many threads and performing socket I/O, try libumem.so. When launching your application within a shell script, set the following environment variable. LD_PRELOAD=/usr/lib/libumem.so;export LD_PRELOAD 2. Tune the Solaris scheduler. Simple scheduler tuning can yield significant performance gains, especially with highly threaded short lived applications. Try the FX scheduling class: priocntl -c FX -e java class_name Try the IA scheduling class: priocntl -c IA -e java class_name Every application is different and true performance is always defined by each individual running their own application. If you run into problems or have questions about Java on Solaris performance visit the java.net performance forum or feel free to send me a comment. Fine print SPEC disclosure: SPECjbb2005 Sun Fire X4200 on Solaris 10 (2 chips, 4 cores, 4 threads) 49,097 bops, 49,097 bops/JVM, SPECjbb2005 Sun Fire X4200 on Red Hat EL 4 (2 chips, 2 cores, 2 threads) 43,076 bops, 43,076 bops/JVM. SPEC™ and the benchmark name SPECjbb2005™ are trademarks of the Standard Performance Evaluation Corporation. Competitive benchmark results stated above reflect results published on www.spec.org as of February 17, 2006. For the latest SPECjbb2005 benchmark results, visit http://www.spec.org/osg/jbb2005.

Tuesday Feb 14, 2006

Java SE Tuning Tip: Server Ergonomics on Windows

J2SE 5.0 Server Ergonomics is not on by default on Windows. The basic reasoning here is that Windows is largely a client platform and automatic server tuning may negatively impact startup performance. We are revisiting this for Mustang, but for now do the following to enable server ergonomics on Windows: 1). Specify JVM tuning options equivalent to server ergonomics java -server -Xmx1g -XX:+UseParallelGC 2). Check to make sure server ergonomics is enabled by checking the JVM version: $ java -server -Xmx1g -XX:+UseParallelGC -version java version "1.6.0-rc" Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-rc-b69) Java HotSpot(TM) Server VM (build 1.6.0-rc-b69, mixed mode) If you see "Server VM", you're ready to test.

Thursday Feb 02, 2006

SPECjbb2005: A Valid Representation of Java Server Workloads

I was reading some of the other blogs at Sun and noticed some entertaining comments on BMSeer's blog. In particular the comments on the entry titled Sun head-to-head wins again: SPECjbb2005. Specifically the set of comments is from Robin (basspetersen@yahoo.com). Robin apparently works for or has close association with HP. Hello Robin, I hope you are reading this. Robin doesn't feel that SPECjbb2005 represents real world Java server applications and workloads, mostly because it doesn't stress the network or I/O subsystems. I strongly disagree and feel that SPECjbb2005 is a valid representation of Java server workloads and has already had a significant impact on JVM and Java SE performance. Here's a few quotes from Robin's comments: "It looks like HP is the only company smart enough to stay out of this benchmark game, with no relevance in the real world." ... "JBB pretends to measure the server-side performance of Java runtime environments but it is not at all representative of a real workload. Running unrealistic workloads to measure performance is a disservice to customers." This statement is a bit naive. SPECjbb2005 has significant features that highlight its relevance to real world workloads. First, garbage collection is part of the measurement interval. SPECjbb2000 called a System.gc() before each measurement interval to ease the impact of GC on the score. This was somewhat necessary to have the benchmark scale back in 2000, not the case now. Garbage collection is fully a part of this benchmark, large GC pauses significantly impact benchmark scores. Second XML DOM L3 is part of the benchmark, will 20% of the workload in DOM tree creation and manipulation. Parsing is not included in order to avoid I/O bottlenecks. Third, the benchmarks must run with thread counts (warehouses) 2X the number of hardware threads on the system. A 4-way must run to 8 warehouses. A 32-way must run 64 warehouses. When did managing 64-threads become trivial and not impacted by system performance? Fourth, many of the optimizations and performance work that started with SPECjbb2005 had direct impact on customer and Java EE benchmark performance. Take a look at the latest SPECjappserver2004 world record. BEA WebLogic Server 9.0 on Sun Fire T2000 Cluster running Sun J2SE 5.0_06 Sun's HotSpot J2SE 5.0_06 was the JVM for this benchmark result, the same JVM which currently holds many, many major performance records on SPECjbb2005. If performance optimizations targeted for SPECjbb2005 have direct impact on Java EE benchmarking, how again is SPECjbb2005 irrelevant? "In my opinion HP does not want to give credit to a bad benchmark by publishing results. Why should they give you the satisfaction of jumping off the bridge after you? Clearly HP thinks the benchmark is not important." HP was on the core development team of SPECjbb2005. Take a look at one of my first blog entries announcing SPECjbb2005. Why would HP think a benchmark was not important or irrelavant when they put resources on the development of the benchmark? . Fifth, I/O and network were purposely left out of the benchmark to concentrate on JVM, OS, and Hardware performance. The benchmark heavily stresses the memory subsystem with large Java heaps and high memory allocation counts. The OS needs to manage many threads and possibly many processes effectively for high performance. SPECjbb2005 stresses JVM, OS, and Memory, it is a complete system benchmark concentrating on Java server performance. Lastly, I would like to see HP submit SPECjbb2005 numbers, competition leads to innovation and performance optimization that benefits customers. Chances are HP is plugging away working to improve their HotSpot implementation, preparing for the day they will submit a result.

Wednesday Jan 25, 2006

Sun Fire E6900 and Hotspot dominate SPECjbb2005 under 32 CPUs

The Sun Fire E6900 (24 chip) takes the lead running SPECjbb2005 on configurations with 32 chips or less with a score of 342,578 bops. This score surpasses the previous high score of 322,719 bops run on a Fujitsu Prime Quest 480 (32 chip!). Why is this result interesting? First, the Sun Fire E6900 surpasses all other competitors in this space, faster than the IBM P5 570 and the Fujitsu Prime Quest 480. Second, and most importantly to me, this is the first of many results that highlight the performance of Sun Hotspot J2SE 5.0_06. Today's a good day for Sun Java Performance. SPEC Footnote: SPECjbb2005 Sun Fire E6900 (24-way, 24 chips, 48 cores) 342,578 bops, 28,548 bops/JVM submitted for review, Fujitsu PRIMEQUEST 480 (32 chips, 32 cores) 322,719 bops, 40,340 bops/JVM, IBM eServer p5 570 (8 chips, 16 cores, 16-way) 244,361 bops, 30,545 bops/JVM. SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 01/23/06 on www.spec.org.

SPECjbb2005: Single Instance vs. Multiple Instance Competitive Comparisons

SPECjbb2005 can be run in single and multiple-instance modes. Single instance is where one JVM runs the benchmark on a single system. Multiple instance is where n JVMs run in parallel, with the benchmark load distributed between the separate JVM processes. SPECjbb2005 also has two equally important metrics. SPECjbb2005 bops (business operations per second) is a measure of overall system throughput, and SPECjbb2005 bops/JVM, which is a measure of JVM performance and scalability. Single instance results target hardware, OS, and highlight JVM performance and scalability. The multiple instance results target hardware, OS, JVM performance and scalability, and highlight total system throughput. Both single and multi instance configurations of SPECjbb2005 can provide a sense of hardware, OS, and JVM performance and scalability. However, single instance configurations put more focus on the throughput delivered by the JVM, where as multi instance configurations put more focus on total throughput delivered by the system. When multiple instance configurations demonstrate higher throughput than single instance configurations, it's usually an indication that there's either a JVM limitation, such as maximum heap size or 64-bit JVM performance, or that there's some hardware architectural aspect of the system that multiple JVMs can take advantage of, such as a NUMA memory architecture. A SPECjbb2005 performance comparison between two hardware platforms is a comparison of the highest bops score as a measure of overall system throughput. Whe comparing hardware platforms the comparison can be made regardless of the benchmark configuration, but its important that you choose a configuration type that matches the deployment characteristics of your system as deployed in production. Most large MP servers with greater than 16 hardware threads are deployed with many, many JVM (or OS) instances, and customers are concerned with complete system throughput and scalability. The comparison is system throughput, not necessarily software component performance, but often JVM scalability is a factor considering each JVM must scale to 8 hardware threads or more. In this case the fastest results by hardware vendor A should be compared to the fastest results by hardware vendor B, with an eye to JVM scalability as measured by the bops/JVM metric. Small x86 or x64 systems with 8 or less cores are not typically deployed with more than one JVM. Customers are concerned with total system throughput but also efficient system utilization by their Java server software and the JVM. The SPECjbb2005 single instance configuration is a good match for small systems with less than 8 hardware threads. SPECjbb2005 multiple instance results should not be used to compare systems with less than 8 hardware threads simply because those systems are not typically deployed in production in that fashion. Its the responsibility of the hardware and JVM vendor along with the benchmark submitter to hold the line on SPECjbb2005 configuration types and to ensure that the configuration type matches the system under test and more importantly how they are deployed in production. JVM performance comparisions using SPECjbb2005 are a bit different. In this case JVM performance and scalability are the concentration and are best demonstrated using the single instance SPECjbb2005 configuration. When comparing JVMs, multiple instance results can only be compared to other multiple instance results, and its best if each result was run with the same number of JVM instances. Single instance SPECjbb2005 results on large SMP systems can help give insight into performance capabilities of the JVM within given instruction set and the potential scalability characteristics on other supported platforms. The latest SPECjbb2005 score can be found at http://www.spec.org/jbb2005

Wednesday Jan 18, 2006

Sun Hotspot SPECjbb2000 World Record

The last results for SPECjbb2000 have been accepted at SPEC and its official, Sun Hotspot running on a Fujitsu PrimePower 25000 holds the end all SPECjbb2000 World Record!. As much as I personally disliked this benchmark (I've talked about it quite often), this result is more proof of the world class performance and scalability of Sun J2SE 1.5.0_06. Congratulations to Fujitsu Limited and the Sun Hotspot development team!

Monday Jan 09, 2006

SPECjbb2000 has finally retired

SPECjbb2000 has finally retired! SPECjbb2005 has replace SPECjbb2000 and the competitive landscape has changed drastically. Strangely, a particular JVM vendor who showed strong performance in SPECjbb2000 doesn't seem to do as well with SPECjbb2005. Hmmm. Gone are the days when a stunt JVM can make broad claims in world record performance based on a 5 year old benchmark. No more risky lock elision optimizations for 30% gains and special object ordering and prefetching because GC is outside the measurement intervals. Good riddance I say!

Tuesday Dec 06, 2005

New Java Performance Tuning Whitepaper

Check out our new Java performance tuning whitepaper on java.sun.com. This has been on the Java performance group's to do list for very long time, thanks to Tom Marble for making this happen. There's nothing liking the kickin' performance the new UltraSPARC T1 processor, the Sun Fire T1000 and T2000 servers, and our latest update release J2SE 5.0_06 to give the needed kick in the pants to put out a tuning guide. This is a work in progress so your feedback is very much appreciated and needed. Thanks.

Wednesday Nov 30, 2005

New SPECjbb2005 World Record - Java SE and UltraSPARC IV+

Sun has again achieved a new world record on SPECjbb2005, beating IBM's latest result by 1.5%. Take a look at the press release. This result is using the newly released J2SE 5.0._06. IBM's score: 244,361 bops, 30,545 bops/JVM Sun's score: 248,075 bops, 31,009 bops/JVM Both the IBM and Sun results are 4 instance SPECjbb2005 runs. Each result has 32 "Processors" available to Java, as determined by the Java API java.lang.Runtime.availableProcessors(). The systems are equivalent from the view of Java, each has 32 hardware threads or "virtual" CPUs. How the 32 threads are implemented on each system is different, for example take a look at the Sun Fire 6900 details. I like that IBM is actively submitting SPECjbb2005 results. Competition spurs innovation, which is a great thing for everyone, especially customers with enterprise Java server applications. It would be great to see other large system vendors step up and compete on SPECjbb2005. HP? Fujitsu? Try running with Sun J2SE 5.0_06, its rather easy. Fine print SPEC disclosure: SPECjbb2005 Sun Fire E6900 (16-way, 16 chips, 32 cores) 248,075 bops, 31,009 bops/JVM submitted for review, IBM eServer p5 570 (8 chips, 16 cores, 16-way) 244,361 bops, 30,545 bops/JVM . SPEC, SPECjbb reg tm of Standard Performance Evaluation Corporation. Results as of 11/30/05 on www.spec.org unless otherwise noted.

Tuesday Oct 18, 2005

SPECjbb2005 World Record - Hotspot and UltraSPARC IV+ Rock!

Sun has achieved a new world record on SPECjbb2005, beating IBM's latest result by 8%. Take a look at the press release here. IBM's score: 224,200 bops Sun's score: 241,560 bops SPECjbb2005 is the next generation Java server application benchmark. It replaces SPECjbb2000 on January 4th, 2006. IBM has stepped up and submitted several benchmark results. This is fantastic. Where's BEA, HP, and Fujitsu? Dig yourselves out of the quagmire of SPECjbb2000 and submit result on a viable benchmark for a change. SPECjbb2000 = Old benchmark, Invalid results for today's applications SPECjbb2005 = New benchmark, developed to model today's applications Where do you want your JVM and Hardware vendor concentrating their resources?

SPECjappserver2004 World Record - Hotspot Rocks!

Take a look at the latest SPECjappserver2004 world record result. Remember BEA's press release a few weeks ago? Take a look here. In that Press release BEA referenced a SPECjappserver2004 result as an example of JRockit's high performance. Hmmmm. The latest SPECjappserver2004 world record is indeed using BEA Weblogic, but its running on Sun Fire X4100's, Solaris 10, and dare I say Sun Hotspot J2SE 5.0_06. Now, JRockit doesn't support Solaris, and the hardware platforms are different, but this result certainly doesn't support BEA's marketing fluff. The reality is that most of BEA's performance messaging is just that, marketing fluff, and the rubber hits the road with the performance of your application in your lab. This is where Hotspot will truely shine. It will be a few more weeks until 5.0_06 is released. In the meantime, take a look at the latest mustang builds on java.net if you curious. Also, visit the performance project and forums on Java.net. We want to here your feedback!

Friday Sep 23, 2005

Why Does Sun Publish 64-bit SPECjbb2000?

Sun has chosen to only submit 64-bit SPECjbb2000 results because of limitations of the benchmark which have allowed our competition to make special case optimizations in the 32-bit case. JVM optimizations for this benchmark can only really apply to customer workloads using 64-bit JVMs with huge heaps. A good example of the special case optimizations is the SPECjbb2000 call for a full system GC during the initialization stage of the benchmark and before every measurment interval, effective removing Full GC performance from the benchmark. Notice the flags used for the JRockit and IBM JVMs for this benchmark: BEA: -XXfullsystemgc IBM: -Xcompactexplicitgc (see: http://www.spec.org for additional flag details.) These flags suggest special case optimizations for the unrealistic use case of SPECjbb2000. The BEA flags suggests enabling "a full system gc", and the IBM flags suggests "compact at a explicit gc call". Both special case for this benchmark. SPECjbb2000 is an old, unrealistic benchmark that is ripe for "benchmark special" optimizations. Sun is not interested in optimizing for benchmarks that have no real effect on our customers. We publish with 64-bit because it is a better match for environments where full gc's will be avoided (by using huge heaps), and therefore optimizations made in that space yield benefits for our customers as well. SPECjbb2005 on the other hand has addressed many of the benchmarks limitations. Explicit GC calls have been removed, allowing Full GC's to effect the performance during the measurement interval. XML and J2SE API's have replaced hand-rolled code. All in all the new benchmark has been successfully updated to model modern Java server applications. Please notice as well that BEA has only 1 public submission for SPECjbb2005 and IBM has none. Also notice that Sun Hotspot is the performance leader with SPECjbb2005. Why? Because the new benchmark actually attempts to model real customer applications through API use and GC behavior, and Sun's concentration has alway been on customer performance, not stunt JVM's for SPECjbb2000.

Sunday Jun 26, 2005

SPECjbb2005: Multiple Instance Results

SPECjbb2005 can be run in a multiple-instance mode, where n JVMs run in parallel, with the benchmark load distributed between the separate JVM processes. Take a look at the first ever Multiple-Instance result here: http://www.spec.org/jbb2005/results/res2005q2/jbb2005-20050613-00007.html The first question you might ask is: What's bops and bops/jvm? Here's the scores listed: bops = 75,862 bops / JVM = 18,966 bops is the total benchmark score for the system. It is the sum of the work performed by each of the individual JVMs. bops / JVM is the average work performed by each individual JVM. Now, how does this compare to a single instance result? Take a look at our Sun Fire V40z result: http://www.spec.org/jbb2005/results/res2005q2/jbb2005-20050613-00003.html Here's the scores listed: bops = 37,034 bops / JVM = 37,034 bops and bops/JVM are identical. Makes sense since this is a single instance result and therefore there is only 1 JVM. So how do you compare single and multiple instance results? First a bit of background. The single instance results highlight hardware, OS, and JVM performance and scalability. The multiple instance results highlight hardware, OS, JVM performance and scalability, and JVM interoperability. Java is deployed in the enterprise in many different ways and a common production environment is a large SMP server running many (10s to 100s) JVMs. The multiple instance configuration allow JVM and hardware vendors to demonstrate this common deployment models in a public competitive benchmark, especially useful on large SMP servers. This was a great idea. So again, how do you compare single and multiple results? Use the bops score alone. A comparison between these two configurations ends up being a limited comparison of system scalability and performance. Limited in that hardware and OS scalability is tested, but JVM scalability has been limited to include the effect of many JVMs working together. So here's a comparsion on the Sun V40z 4-way result and the Sun V890 8-way results: 4-way V40z: 37034 bops 8-way V890: 75862 bops

Wednesday Jun 22, 2005

SPECjbb2005: First Submitted Result are Live!

I'm glad to put this one to bed. This benchmark was almost a year in the making, and publishing the first set of results is the last important milestone. http://www.spec.org/jbb2005/results/jbb2005.html Summary of Sun's submissions: First 4-way result on a Sun Fire V40z (4 x 2.6Ghz Opteron 252) = 37,034 bops First 8-way result on a Sun Fire V890 (8 x 1.35Ghz Ultra SPARC IV) = 75,862 bops

Tuesday Jun 21, 2005

JavaOne 2005: TS-7984 Java Platform Performance

Brian Doherty and I are giving the Java(TM) Platform Performance talk this year at Java One. Be sure to check it out if you'll be at the show this year, we've been very busy with performance work and have details on the optimizations, data, etc. Java™ Platform Performance Session ID: 7984 Room: Gateway 102/103 Date: 28-JUN-05 Start Time: 11:00 AM Duration: 60

Thursday Jun 16, 2005

SPECjbb2005(TM): Next Generation JVM Server Performance Benchmark

SPECjbb2005(TM) was released today. This is the next generation JVM server benchmark and a follow-on to the popular SPECjbb2000 benchmark. SPECjbb2005 was developed by SPEC’s Java subcommittee, with contributions from BEA, Darmstadt University of Technology, HP, IBM, Intel and Sun. http://www.spec.org/jbb2005/press/release.html I represent Sun at SPEC and was the contributing developer working on this benchmark from Sun. As you can imagine, I'm very happy to finally see this benchmark out the door. I would like to congratulate the other contributing developers from the companies listed above, great work Allan, Elena, Sam, Stefan and Veeru. With the release SPECjbb2005 comes the inevitable demise of SPECjbb2000(TM). Now, the dates when SPECjbb2000 have not been decided, but alas the release of a new benchmark does make the predecessors retirement imminent. Good riddance I say, I'm sick of optimizing System.arraycopy! Stay tuned for benchmark results..

Wednesday Jun 15, 2005

Java Performance Tip for Solaris x86 32-bit on AMD Opteron

Be sure to use -XX:LargePageSizeInBytes=2m This flag should improve performance of Java server applications 2-5%. Of course you mileage will vary. It may help client applications as well. When running 32-bit Sun J2SE 5.0_04 or earlier on Solaris x86 / Opteron systems. The largest page size that AMD Opteron supports is 2mb. These versions of the JVM assume a 4mb large page size, as it is on Intel Xeon processors. J2SE 5.0_05 and later will fix this, as do the latest Java SE 6 (Mustang) binaries on java.net. http://www.java.net/download/jdk6/binaries/ Here's a sample command line for Java server applications on Solaris 10 x86 32-bit with 5.0_04 or earlier. If server class machine with J2SE 5.0 (2 or more cores, 2 or more GBs of RAM) java -Xms3g -Xmx3g -XX:LargePageSizeInBytes=2mb _JavaApp_ If you're running J2SE 1.4.2: java -server -Xms3g -Xmx3g -XX:UseParallelGC -XX:LargePageSizeInBytes=2m _JavaApp_ (you should download and test J2SE 5.0, you'll probably be pleasantly surprised)

Tuesday Jun 14, 2005

Java Synchronization Optimizations

Tim Bray’s recent blog, http://www.tbray.org/ongoing/When/200x/2005/06/12/Threads has given me the kick in the ribs I needed to share some of what we’ve been working on in JVM performance at Sun. So Paul Hohensee and I came up with the following sneek peak. Tim raises a good point about the use of Java(TM) synchronization in J2SE library code; synchronization is heavily used in the core collection classes to ensure threads don’t step on shared data. Sure, there are alternative APIs that a developer can use when they know their application design ensures that won’t happen, most notably StringBuilder instead of StringBuffer and ArrayList instead of Vector. But in many cases code changes are not an option, so we’ve been working on ways for the Hotspot(TM) JVM (Java Virtual Machine) to optimize synchronization. First, a little background on object locking. When a thread attempts to lock a Java object, it will find that either no thread owns the lock or that another thread already does. In the latter case, we say that the lock is “contended”. On the other hand, if every thread that locks an object finds it unowned, we say that the lock is “uncontended”. As long as a lock is uncontended, it stays in what we call a “thin” state, in which the lock state is recorded only in the object header and thread-local storage (in the Hotspot JVM, the thread stack). The JVM executes fast path code code to acquire and release thin lock ownership. The fast path typically involves one or two compare-exchange instructions (lock;cmpxchg on x86 and cas on SPARC). The Hotspot JVM uses two, one each for acquire and release. When a lock becomes contended, it’s “inflated” into the “fat” state, where it’s represented by a shared data structure built around an OS mutex. Threads add themselves to a waiters list and park on the mutex, to be woken up by the runtime system at some future time. We call this process the slow path because it’s really, really expensive. It’s possible for a contended lock to eventually become uncontended. When that happens, it may be “deflated” into back into the “thin” state. So much for the background. It turns out that most locks are uncontended in Java programs, so fast path performance is relatively more important than slow path. But the fast path can be slow, because a compare-exchange instruction is atomic and must wait for all incomplete instructions and memory operations to finish. On modern multi-issue and out-of-order processors with NUMA memory systems, that can take hundreds of cycles. This is really obvious in benchmarks like SPECjbb2000 (http::/www.spec.org), where uncontended locking takes up to 20% of the total time on x86 boxes, and on the SciMark2 Monte Carlo sub-test (http://math.nist.gov/scimark2), where it can take more than half the total time. We’re working on various ways to reduce or eliminate uncontended locking overhead in the Hotspot JVM. There’s basically three approaches. First, we can get rid of the lock entirely. If the JVM can determine that the object being locked will never be accessed by any thread but the one that allocates it, then its lock can never be contended, so the JVM never has to lock it. The analysis required to figure this out is called “escape” analysis, because it determines whether or not an object can “escape” from the prison of its allocating thread. I’ve pointed to some references on how this is done at the end of this note. Second, we can notice that an unlock of an object is immediately followed by a lock of the same object and eliminate both the unlock and lock. This process is called “lock coarsening”. It does enlarge the size of the locked region, which might become a problem if the lock ever becomes contended: a long locked region means threads might have to wait awhile to acquire the lock. This is usually a non-issue though, because most locks are uncontended. Plus, locked regions are usually quite small to begin with, so making them a bit bigger is no big deal. Simple lock coarsening has been implemented in the latest mustang builds starting at build 38. See: http://www.java.net/download/jdk6/binaries/ Finally, we can make the fast path faster by eliminating the compare-exchange instruction(s) and replacing them by a compare-and-branch sequence. This is pretty tricky stuff: see the references for various ways of doing this safely. Combining these three approaches can basically eliminate most of the synchronization overhead in a typical Java application. Coming soon to a JVM near you. References: Escape analysis: http://www.research.ibm.com/people/g/gupta/toplas03.pdf http://www.research.ibm.com/people/j/jdchoi/escape-pointsto.html Thin locks: http://www.research.ibm.com/people/d/dfb/papers/Bacon98Thin.pdf Faster fast path: http://www.usenix.org/publications/library/proceedings/jvm01/full_papers/dice/dice.pdf http://www.research.ibm.com/trl/projects/jit/paper/p020-kawachiya.ps
About

dagastine

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today