Friday Jan 22, 2010

Comparative data of ORACLE 10g on SPARC & SOLARIS 10

Oracle 10g OLTP performance on SPARC chips








A boring ratio

Customers would love to have their performance levels linked to their hardware. But more often than you think, they migrate from System X (designed 10 years ago) to System Y (fresh from the oven) and are surprised with the performance improvements. In the past two years, we have completed many successful migrations from F15k/E25k servers to new Enterprise Servers M9000. Customer have reported great improvements in throughput and response time. But what can you really expect and what percentage of the improvement is actually due to the operating system enhancement ? Can the recent small frequency increase on our SPARC64 VII chipset be at all interesting ? The new SPARC64 VII 2.88Ghz available on our M8000 and M9000 flagships propose no architectural change, no additional features and a modest frequency increase going from 2.52 Ghz to 2.88 Ghz - for a ratio of 1.14. We could stop our analysis there and label this change 'marginal' or 'not interesting'. But my initial testings showed a comparative OLTP peak throughput to be way higher than this frequency-based ratio.



What happened ?












A passion for Solaris

Most of the long term Sun employees have a passion for Solaris. Solaris is the uncontested Unix leader and include such a huge amount of features that when you are a Solaris addict, it is difficult to get in love with another Operating System. And Oracle executives made no mistake : Sun has the best UNIX kernel & performance engineers in the world. Without them, Solaris would not scale today to a 512 hardware thread system (M9000-64).

But of course, Solaris is a moving target. Every release brings its truck load of features, bug fixes and other performance improvements. Here are critical fixes done between Solaris 10 Update 4 and the brand new Solaris 10 Update 8 influencing Oracle performance on the M9000 :

  • In Solaris 10 Update 5 (05/08), we optimized interrupt management ( cr=5017144), math operations (cr=6491717). We also streamlined CPU yield (cr=6495392) and cache hierarchy (cr=6495401).

  • In Solaris 10 Update 6 (10/08), we optimized libraries and implemented shared context for Jupiter (cr=6655597 & 6642758)

  • In Solaris 10 Update 7 (05/09), we enhanced MPXIO as well as the PCI framework (cr=6449810 and others) and improved thread scheduling (cr=6647538). We also enhanced Mutex operations (cr=6719447).

  • Finally, in Solaris 10 Update 8 , after long customer escalations, we fixed the single threaded nature of callout processing (cr=6565503-6311743). [This is critical for all calls made to nanosleep & usleep.] We also improved the throughput & latency of the very common e1000g driver (cr=6335837 + 5 more) and optimized the mpt driver (cr=6784459). We cleaned up interrupt management (cr=6799018) and optimized bcopy and kcopy operations (cr=6292199). Finally, we improved some single threaded operations (cr=6755069).

My initial SPARC64 VII iGenOLTP tests were done with Solaris 10 Update 4. But I could not test the new SPARC64 VII 2.88Ghz with this release because it was not supported ! Therefore, I had to compare the new chip performance to SPARC64VII 2.52Ghz using each S10U4 and S10U8. We will see below that most of the improvements are not coming from the frequency increase but from Solaris itself.






Chips & Chassis

Please find below , the key characteristics of the chips we have tested :



Chips

UltraSPARC IV+

SPARC64 VI

SPARC64 VII

SPARC64 VII (+)

Manufacturing

90nm

90nm

65nm

65nm

Die size

356 sq mm

421 sq mm

421 sq mm

421 sq mm

Transistors

295 million

540 million

600 million

600 million

Cores

2

2

4

4

Threads/core

1

2

2

2

Total threads

2

4

8

8

Frequency

1.5 Ghz

2.28 Ghz

2.5Ghz

2.88Ghz

L1 I-cache

64 KB

128 KB/core

512 KB

512 KB

L1 D-cache

64 KB

128 KB/core

512 KB

512 KB

On-chip L2

2 MB

6 MB

6 MB

6 MB

Off-chip L3

32 MB

None

None

None

Max Watts

56 W

120 W

135 W

140 W

Watts/thread

28 W

30 W

17 W

17 W



Note on (+): The new SPARC64 VII is not officially labeled with a plus sign in order to reflect the absence of new features.





Now, here is our hardware list. Note that to avoid the need for a huge Client system, we ran this iGenOltp workload in a Console/Server mode. It means that the Java processes sending SQL queries via JDBC are running directly on the server tested. While this model was unusual ten years ago in the era of Client/Server, it is more and more commonly found today in new customer deployments.



Servers

E25k

M9000-32

M9000-32

M9000-32

Chip

UltraSPARC-IV+

SPARC64 VI

SPARC64 VII

SPARC64 VII+

# chips

8

8

8

8

Total hardware threads

16

16

32

32

Frequency

1.5 Ghz

2.28 Ghz

2.52 Ghz

2.88 Ghz

System Clock

150 Mhz

960 Mhz

960 Mhz

960 Mhz~

RAM

64 GB

64 GB

64 GB\*

64 GB\*

Operating System

Solaris 10 Update 4

Solaris 10 Update 4

Solaris 10 Update 4 & 8

Solaris 10 Update 8






Console system


Storage

SE9990V


X4240


[shared]

64 GB cache


Opteron quad-core



25 TB


2x2.33Ghz



200 Hitachi HDD





15k RPM





8x2Gbit/s




Note on (~): While the system clock has not changed, the new M9000 CMUs are equipped with an optimized Memory Access Controller labeled MAC+. The MAC+ chip set is critical for system reliability, in particular for the memory mirroring and memory patrolling features. We have not identified performance improvements linked to this new feature.

Note on (\*): Those domains have 128GB total memory. To compare apple-to-apple, 64GB of memory are allocated, populated and locked in place with my very own _shmalloc tool.






Chart

The iGenOLTPv4 workload is a Java-based lightweight OLTP database workload. Simulating a classic Order Entry system, it is tested in stream mode (I.e no wait time between transactions). For this particular exercise, we have created a very large database of 8 Terabyte total. This database is stored on the SE9990V using Oracle ASM. We query 100 million customer identifiers on this very large database in order to create an I/O intensive (but not I/O bound) workload similar to the largest OLTP installations in the world. (Example : the E25ks running the bulk load of Oracle internal applications). The exact throughput in number of transactions per second and average response times are reported and coalesced for each scalability level. For this test, we used Solaris 10 Update 4 & 8, Java version 1.6 build 16, and the Oracle database server 10.2.0.4




Performance notes :

  • In peak, the new SPARC64VII 2.88Ghz produce 1.10x OLTP throughput compared to the 2.52Ghz on S10U8.

  • But compared to the 2.52Ghz chips on S10U4, the ratio is 1.54x and compared to the SPARC64 VI it is 2.38x.

  • For a customer willing to upgrade a E25k equipped with 1.5Ghz chips, the throughput ratio is 4.125 ! It means that we can easily replace a 8 boards E25k with a 2 boards M8000 for better throughput and improved response times.

  • Average transaction response times in peak are 126 ms on the UltraSPARC IV+ domain, 87ms on the SPARC64 VI, 82 ms on the SPARC64VII 2.52Ghz (U4), 77 ms on the SPARC64 VII 2.52Ghz (U8) and 72 ms on the latest chip.





Conclusion

As expected, Oracle OLTP improvements due to the new SPARC64VII chip are modest using the latest Solaris 10. However, all the customer already in production using previous release of Solaris 10 will see throughput improvement up to 1.54x. Most likely, this is enough to motivate a refresh of their system. And all E25k customers have now a very interesting value proposition with our M8000 and M9000 chassis.


See you next time in the wonderful world of benchmarking....



Wednesday Dec 09, 2009

Inside the Sun Oracle Database machine : The F20 PCie cards

We are living a drastic change, something close to a revolution with the new Sun Oracle Database machine. Why ? With the critical use of enterprise flash memory, this architecture is not anymore reserved to data warehouses but very well suited to Online Transaction Processing.  We are preparing benchmark results on this platform and actively shipping systems to customers. In the meantime, and in a suite of short entries, I will describe the key innovations at the heart of this environment. 

Let's start with the Sun Flash Accelerator F20 PCIe Cards.Sun Flash Accelerator F20 PCIe Card


 Each Exadata cell (or Sun Oracle Exadata Storage Server) include 384 GB total of flash storage producing up to 75,000 IOPS [8k block]. This capacity is obtained via four 96GB F20 PCIe cards detailed below. A full rack configuration comes equipped with 5.3TB of Flash storage and can produce an amazing 1 million IOPS [8k block]. This huge cache is not only used smartly and automatically by the Database machine but can also be user-managed via the ALTER TABLE STORAGE command and the CELL_FLASH_CACHE argument.

Here is the detailed architecture of the F20 PCIe card : 

F20_a

 As you can see, we obtain the total 96GB capacity via four Disk on Module (DOM). Each DOM contains four 4GB SLC NAND component on the front side and four on the back side. It gives us a 32GB capacity of which 24GB is addressable. To even accelerate flash performance, 64MB of DDR-400 DRAM per DOM provide a local buffer cache.

Finally, the DOM need to manage all of its components, track faulty blocks, handle load balancing  and communicate with the outside world using standard SATA protocols. This is achieved with a Marvell SATA2 Flash Memory Controller.

Outside of the four DOMS, a Supercapacitor module provides enough backup power to flush data from DRAM to non-volatile flash devices, therefore maintaining data integrity during a power outage. With a 5 years lifespan in a well-cooled chassis, this modules are superior to classic batteries.Finally, a LSI eight-port SAS controller connect the four DOMS to a 12-port SAS expander for external connectivity.

 We measured in Sun labs a 16.5W power consumption per F20 card. We were able to produce 1GB/s in sequential read (1MB I/O) and 100,110 IOPS (4k random read) for each card.  In addition, we can replace about two hundred 15k HDDs latest generation for a power consumption of 0.165 milliwatts per 4K I/O and an estimated MTBF of 227 years. Amazing !

See you next time in the wonderful world of benchmarking. 

<script src="http://www.google-analytics.com/urchin.js" type="text/javascript"> </script> <script type="text/javascript"> _uacct="UA-917120-1"; urchinTracker(); </script>

Monday Jul 06, 2009

What processor will fuel your first private Cloud : INTEL Nehalem or AMD Istanbul ?

>What processor will fuel your first private Cloud : INTEL Nehalem or AMD Istanbul ?

Where IT is going ...
You may have observed the big trend of the moment : Take your old slide decks, banners and marketing brochures and try to plug in the word cloud as many times as possible. A current Google search of the words Cloud Computing yield today more than 31 million results ! Even if you search only on Cloud (getting 175 Million+ results), the first entry in the list (discounting the Sponsored results) is this one. Amazing fashion of the moment !

As we recently described in this white paper, there are not one but many clouds. I had recent conversations on this topic with customers in our Menlo Park Executive Briefing Center . While they all say that they will not be able to host their entire IT department in a Public Cloud. , they are interested in the notion of combining a Public cloud service with multiple Private Clouds - this is the notion of Hybrid Cloud.








Private clouds
The Sun Solution Centers and SUN Professional Services are starting now to build the first private clouds architectures based on Sun Open Source products. The most common building block for those is the versatile Sun Blade 6000. Why ? Because of the capacity of this chassis to host many different type of CPU's (x86 & SPARC) and operating systems (Windows, Linux, OpenSolaris, Solaris or even Vmware vSphere). At the same time, INTEL and AMD have released two exceptional chips : the INTEL XEON 5500 (code name Nehalem) and the six-core AMD Opteron (code name Istanbul). I had the opportunity to test these chips recently and will give you here a few data points.





Cloud benchmarks

We may not have today any Cloud related standard benchmarks. However, if I look at the different software components of a private cloud, it seems that Computing capabilities (in integer and floating point) and Memory Performance are the two key dimensions to explore. You may argue that your cloud need a database component ...but improved caching mechanism (memcached for example) and the commoditization of Solid State Disks (see this market analysis and also here) are moving database performance profiles toward memory or cpu intensive workloads. Additionally, the exceptional power of 10-Gbit based Hybrid storage appliances (like the Sun Storage 7410 Unified Storage System) makes us less concerned by I/O & network bound situations. It is good to know that this new storage appliances are a key element of our public cloud infrastructure.








Nehalem & Istanbul Executive summary

Both AMD & INTEL had customer investments in mind as their new chips use the same sockets than before ... so they can be used in previously released chassis. What you will typically have to do after upgrading to the new processors is to download the latest platform BIOS. Another good idea is also to check on your OS level ... the latest OS releases include upgraded libraries and drivers. Those are critical if performance is near the top of your shopping list. See here for example.

For other features, please refer to the key characteristics below :

Feature

INTEL Xeon X5500 (Nehalem)

AMD Opteron 2435 (Istanbul)

Release date

March 29, 2009

June 1st, 2009

Manufacturing

45 nm

45 nm

Frequency (tested)

2.8Ghz

2.6Ghz

Cores

4

6

Strands/core

2 [if NUMA on]

1

Total #strands

8

6

L1 cache

256 KB [32KB I. + 32KB D. per core]

768 KB [128 KB per core]

L2 Cache

1 MB [256KB per core]

3 MB [512KB per core]

L3 cache

2 MB shared

6 MB shared

Memory type

DDR3 1333Mhz max. \*

DDR2 800 Mhz

Nom. Power

95 W

75W

Major Innovations

Second level branch predictor & TLB

Power savings and HW virtualization

Note : For this test, we used DDR3 1066Mhz.

Now, here is our hardware list :

Role

Model

Blade

Sockets@freq

RAM

AMD Opteron 'Istanbul'

SB6000

X6260

2@2.6Ghz

24 GB

INTEL XEON 'Nehalem'

SB6000

X6270

2@2.8Ghz

24 GB

Console

X4150

N/A

2@2.8Ghz

16 GB




Calculation performance : iGenCPU

iGenCPU is a calculation benchmark written in Java. It calculates Benoit Mandelbrot's fractals using a custom Imaginary Numbers library. The main benefit of this workload is that it naturally creates a 50% floating point and 50% integer calculation. As the number of floating operations produced by commercial software increase every year, this type of performance profile is getting closer and closer to what modern web servers (like Apache) and application servers (like Glassfish) will produce.


Here are the results (AMD Istanbul in Blue, INTEL Nehalem in Red) :




Observations :

  1. Very similar peak throughput (984 fractals/s on INTEL, 1008 fractals/s on AMD)

  2. The AMD chip produce superior throughput at any level of concurrency. At 8 threads, which is a very common scalability limit for commercial virtualization products, it produces 28% more throughput than Nehalem.

  3. It shows the superiority of the Opteron calculation co-processors as we had already observed on previous quad-core generation.

  4. It is more important for calculation to have larger L1/L2 cache then faster L1/L2 cache. The Opteron micro-architecture is naturally a better fit for this workload.




Memory performance : iGenRAM

It is a classic brain exercise when you can not sleep : imagine what you would do with $94 million in your bank account. The iGenRAM benchmark was initially developed in C to produce an accurate simulation of the California Lotto winner determination. It is highly memory intensive using 1Gigabyte of memory per thread. Memory allocation time as well as memory search performance produce a combined throughput number plotted below :



Observations :

  1. The faster DDR3 memory and higher frequency of the INTEL chip make it a better fit for memory intensive workloads. In peak, the Nehalem based system produce 23% more throughput than its competitor.

  2. For a small number of threads (1 to 4), both system produce very similar numbers.

  3. Second level predictor on this repetitive workload most likely help the Nehalem-based system to improve its scalability curve tangent past four threads

  4. As noted, we used DDR3 1066Mhz for this Nehalem test. DDR3 1333Mhz is also available and will increase the INTEL chip advantage on this workload.








Conclusion

At complex question, complex answer... As you have noted, these benchmarks show the AMD Istanbul better suited for calculation intensive workloads but also show better memory performance of the INTEL Nehalem. Therefore, different layers within your private cloud will need to be profiled if you want to determine what is your best choice. And guess which Operating System comes equipped with the right set of tools (I.e Dynamic Tracing) to make the determination : Solaris or OpenSolaris .

[Last minute note: I also performed Oracle 10g database benchmarks on these blades. Maybe for another article..]





See you next time in the wonderful world of benchmarking....



Monday May 11, 2009

Running your Oracle database on internal Solid State Disks : a good idea ?

Scaling MySQL and ZFS on T5440




Solid State Disks : a 2009 fashion

This technology is not new : it originates in 1874 when a German physicist named Karl Braun (pictured above) discovered that he could rectify alternating current with a point-contact semiconductor. Three years later, he had built the first CRT oscilloscope and four years later, he had built the first prototype of a Cat's whisker diode, later optimized by G. Marconi and G. Pickard. In 1909, K. Braun shared the Nobel Prize for physics with G. Marconi.

The Cat's whisker diodes are considered the first solid state devices. But it is only in the 1970s that they appeared in high-end mainframes produced by Amdahl and Cray Research. However, their high-cost of fabrication limited their industrialization. Several companies attempted later to introduce the technology to the mass market including StorageTek, Sharp and M-systems. But the market was not ready.

Nowadays, SSDs are composed of one of two technologies : DRAM volatile memory or NAND-flash non-volatile memory. Key recent announcements from Sun (Amber road and ZFS), HP (IO Accelerator) and Texas Instruments (Ram San 620) as well as lower cost of fabrication and larger capacities are making the NAND based technology a must-try for every company this year.

This article is looking at the Oracle database performance of our new 32Gbytes SSDs OEM'd from Intel. This new devices have improved their I/O capacity and MTBF with an architecture featuring 10 parallel NAND flash channels. See this announcement for more.

If you dig a little bit on the question, you will find this whitepaper . However, the 35% boost in performance that they measured seems insufficient to justify trashing HDDs for SSDs. In addition, as they compare a different number of HDDs and SSDs, it is very hard to determine the impact of a one-to-one replacement. Let's make our own observations.



Here is a picture of the SSD tested – thanks to Emie for the shot !





Goals

As any DBA knows, it is very difficult to characterize a database workload in general. We are all very familiar with the famous “Your mileage may vary” or “All customer database workloads are different”. And we can not trust Marketing department on SSDs performance claims because nobody is running a synthetic I/O generator for a living. What we need to determine is the impact for End-Users (Response time anyone ?) and how the Capacity Planners can benefit from the technology (How about Peak Throughput ?).

My plan is to perform two tests on a Sun Blade X6270 (Nehalem-based) equipped with two Xeon chips and 32Gb of RAM on one SSD and one HDD- with different expectations.

  1. Create a 16 Gigabytes database that will be entirely cached in the Oracle SGA. Will we observe any difference ?

  2. Create a 50 Gigabytes database that can only be cached about 50% of the time. We expect a significant performance impact. But how much ?


SLAMD and iGenOLTP
The SLAMD Distributed Load Generation Engine (SLAMD) is a Java-based application designed for stress testing and performance analysis of network-based applications. It was originally developed by Sun Microsystems, Inc., but it has been released as an open source application under the Sun Public License, which is an OSI-approved open source license. The main site for obtaining information about SLAMD is http://www.slamd.com/. It is also available as a java.net project.

iGenOLTP is a multi-processed and multi-threaded database benchmark. As a custom Java class for SLAMD, it is a lightweight workload composed of four select statements, one insert and one delete. It produces a 90% read/10% write workload simulating a global order system.



Software and Hardware summary

This study is using Solaris 10 Update 6 (released October 31st,2008), Java 1.7 build 38 (released Otober 23rd,2008), SLAMD 1.8.2, iGenOLTP v4 for Oracle and Oracle 10.2.0.2. The hardware tested is a Sun Blade X6270 with 2xINTEL XEON X5560 2.8Ghz and 32 GB of DDR3 RAM . This blade has four standard 2.5 inches disks slots in which we are installing 1x32 Gbytes Sun/Intel SSD and 1x146Gb 10k RPM SEAGATE-ST914602SS drive with read-cache and write-cache enabled.

Test 1 – Database mostly in memory

We are creating a 16 Gigabytes database (4k block size) on one Solid State Disk and on one Seagate HDD configured in one ZFS pool with the default block size. We are limiting the ZFS buffer cache to 1 Gigabytes and allow an Oracle SGA of 24 Gigabytes. All the database will be cached. We will feel the SSD impact only on random writes (about 10% of the I/O operations) and sequential writes (Oracle redo log). The test will become CPU bound as we increase concurrency. We are testing from 1 to 20 client threads (I.e database connections) in streams.


In this case and for Throughput [in Transactions per second], the difference between HDD and SSD are evoluting from significant to modest when concurrency increase. In fact, this is interestingly in the midrange of the scalability curve that we observe a peak of 71% more throughput on the SSD (at 4 threads). At 20 threads, we are mostly CPU bound, therefore the impact of the storage type is minimal and the SSD impact on throughput is only 9%.






For response times [in milliseconds], it is slightly lower with 42% better response times at 4 threads and 8% better at 20 threads.






Test 2 – Database mostly on disk

This time, we are creating a 50 Gigabytes database on one SSD and on one HDD configured in their dedicated ZFS pool. Memory usage will be sliced the same way than test 1 but will not be able to cache more than 50% of the entire database. As a result, we will become I/O bound before we become CPU bound. Please remember that the X6270 is equipped with two eight-threads X5560 - a very decent 16-way database server !

Here are the results :



The largest difference is observed at 12 threads with more than twice the transactional throughput on the SSD. In response times (below), we observe the SSD to be 57% faster in peak and 59% faster at 8 threads.




In a nutshell

My intent for this test was to show you (for a classic Oracle lightweight OLTP workload)

the good news :

When I/O bound, we can replace two Seagate 10k RPM HDDs with one INTEL/SUN SSD for a similar throughput and twice faster response times

On a one for one basis, the response time difference by itself (up to 67%) will make your end users love you instantly !

Peak throughput in memory compared to the SSD is very close : in peak, we observed 821 TPS (24ms RT) in memory and 685 TPS (30ms RT) on the SSD. Very nice !


and the bad news :

When the workload is CPU bound, the impact of replacing your HDD by a SSD is moderate while losing a lot of capacity.

The cost per gigabyte need to be carefully calculated to justify the investment. Ask you Sales rep for more...


See you next time in the wonderful world of benchmarking....

Friday Apr 17, 2009

Sun Blade X6270 & INTEL XEON X5560 on OpenSolaris create the ultimate Directory Server

Sun Microsystems Directory Server Enterprise Edition 6.3 performance on X6270 (Nehalem)[Read More]

Tuesday Mar 24, 2009

Creating a new blog : MrCloud

As a new media to publish around Sun and my personal activities around the Sun Cloud initiative, I have created today a new blog : MrCloud.

See this entry...

I saw two clouds at morning

Tinged by the rising sun,

And in the dawn they floated on

And mingled into one.

-John Brainard

Tuesday Mar 17, 2009

Improving MySQL scalability blueprint

My previous blog entry on MySQL scalability on the T5440 is now completed by a Sun BluePrint that you can find here.



See you next time in the wonderful world of benchmarking....
<script SRC="http://www.google-analytics.com/urchin.js"></script><script> _uacct="UA-917120-1"; urchinTracker(); </script>

Monday Nov 10, 2008

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

Scaling MySQL on a 256-way T5440 server using Solaris ZFS and Java 1.7

A new era

In the past few years, I published many articles using Oracle as a database server. As a former Sybase system administrator and former Informix employee, it was obviously not a matter of personal choice. It was just because the large majority of Sun's customers running databases were also Oracle customers.

This summer, in our 26 Sun Solution Centers worldwide, I observed a shift. Yes, we were still seeing older solutions based on DB2, Oracle, Sybase or Informix being evaluated on new Sun hardware. But every customer project manager, every partner, every software engineer working on a new information system design asked us : Can we architect this solution with MySQL ?

In many cases, if you dared to reply YES to this question, the next interrogation would be about the scalability of the MySQL engine.

This is why I decided to write this article.



Goals

Please find below my initial goals :

  1. Reach a high throughput of SQL queries on a 256-way Sun SPARC Enterprise T5440

  2. Do it 21st century style i.e. with MySQL and ZFS , not 20th century style i.e with OraSybInf... and VxFS

  3. Do it with minimal tuning i.e as close as possible as out-of-the-box



This article is describing how I achieved this goals. It has two main parts : a short description of the technologies used, then a showing of the results obtained.





Sun SPARC Enterprise T5440 server
The T5440 server is the first quad-socket server proposing 256 hardware threads in just four rack units. Each socket host a UltraSPARC T2 Plus processor which propose eight cores and 64 simultaneous threads into a single piece of silicon. While a lot of customers are interested in the capacity of this system to be divided into 128 two-way domains, this article explores the database capacity of a single 256-way Solaris 10 domain.




The Zettabyte file system
Announced in 2004 and introduced part of OpenSolaris build 27 in November 2005, ZFS is the one-and-only 128-bit file system. It includes many innovative features like a copy-on-write transactional model, snapshots and clones, dynamic striping and variable block sizes. Since July 2006, ZFS is also a key part of the Solaris operating system . A key difference between UFS and ZFS is the usage of the ARC [Adaptive Replacement Cache] instead of the traditional virtual memory page cache. To obtain the performance level shown in this article, we only had to tune the size of the ARC cache and turn off atime management on the file systems to optimize ZIL I/O latency. The default ZFS recordsize is commonly changed for database workload. For this study, we kept the default value of 128k.






MySQL 5.1
The MySQL database server is the leading Open Source database for Web 2.0 environment. MySQL was introduced in May 1995 and has never stopped to be enriched with features. The 5.1 release is an important milestone as it introduces support for partitioning, event scheduling, XML functions and row based replication. While Sun is actively working on implementing a single instance highly scalable storage engine, this article is showing how one can reach a very high level of SQL query throughput using MySQL 5.1.29 64-bit on a 256-way server.







SLAMD and iGenOLTP
The SLAMD Distributed Load Generation Engine (SLAMD) is a Java-based application designed for stress testing and performance analysis of network-based applications. It was originally developed by Sun Microsystems, Inc., but it has been released as an open source application under the Sun Public License, which is an OSI-approved open source license. The main site for obtaining information about SLAMD is http://www.slamd.com/. It is also available as a java.net project.

iGenOLTP is a multi-processed and multi-threaded database benchmark. As a custom Java class for SLAMD, it is a lightweight workload composed of four select statements, one insert and one delete. It produces a 90% read/10% write workload simulating a global order system. For this exercise, we are using a maximum of 24 milllion customers and 240 million orders in the databases. The database is divided “sharded” in as many pieces as the number of MySQL instances on the system. [See this great article on database sharding]. For example, for 24 database instances, database 1 store customers 1 to 1 million, database 2 store customers 1 milion to 2 million and so on. The Java threads simulating the workload are aware of the database partitioning scheme and simulate the traffic accordingly.

This approach can be called “Application partitioning” as opposed to “Database partitioning”. Because it is based on a shared-nothing architecture, it it natively more scalable than a shared-everything approach (as in Oracle RAC).




Java Platform Standard Edition 7

Initially released in 1995, the programming language Java started a revolution in computer languages because of the concept of Java Virtual Machine causing instant portability across computer architectures. While the 1.7 JVM is still in beta release, it is the base of my iGenOltpMysql Java class performing the workload shown in this article. The key enhancement of the JVM 1.6 was the introduction of native Dtrace probes. The 1.7 JDK is an update packed with performance related enhancements including an improved Adaptive Compiler, optimized Rapid Memory Allocation , finely tuned garbage collector algorithms and finally a lighter thread synchronization capability causing better scalability. For this article we used the JDK7 build 38.


Software and Hardware summary

This study is using Solaris 10 Update 6 (released October 31st,2008), Java 1.7 build 38 (released Otober 23rd,2008), SLAMD 1.8.2, iGenOLTP v4.2 for MySQL and MySQL 5.1.29. The hardware tested is a T5440 with 4xUltraSPARC T2 Plus 1.2Ghz and 64 GB of RAM . A Sun Blade 8000 with 10 blades each with 2xAMD Opteron 8220 2.8Ghz and 8GB RAM is used as a client system. Finally a Sun ST6140 storage array [with 10x146GB 15k RPM drives] is configured in RAID-1 [2 HS], with two physical volumes and connected to the T54440 with two 4GB/s controllers.





Scaling vertically first

This is a matter of methodology. The first step is to determine the peak throughput of a single instance of MySQL with iGenOLTP using InnoDB then use approximately 75% of this throughput as the basis for the horizontal scalability test. ZFS and MySQL current best practices guided the choice of all the tunables used. [available upon request] The test is done in stabilized load with each simulation thread executing 10 transactions per second. Please find below the throughput and response time scalability curves :






Note that the peak throughput is 725 transactions per second which corresponds to 4350 SQL statements per second. We are caching the entire 1 Gbyte database. The only I/Os happening are due to the delete/insert statements, the MySQL log and the ZFS Intent Log. We will be using 75% of the peak workload simulation as the base workload per instance for the horizontal scalability exercise. Why 75% ? Our preliminary tests showed that it the was the best compromise to reach maximum multi-instance throughput.



Scaling horizontally

The next step was to increase the number of instances while increasing proportionally the database size (number of customer ids). We will have the same 600 TPS workload requested on each instance but querying a different range within the global data set. The beauty of the setup is that we do not have to reinstall the MySQL binaries multiple times : we could just use soft links. The main thing to do was to configure 32 ZFS file systems on our ZFS pool and then to create & load the databases. This was easily automated with ksh scripts. Finally, we had to customize the Java workload to query all the database instances accurately...

Here are the results :






As you can see, we were able to reach a peak of more than 79,000 SQL queries per second on a single 4 RU server. The transaction throughput is still increasing after 28 instances but this is the sweet spot for this benchmark on the T5440 as guided by the transactions average response time. At 28 instances, we observed less than 30ms average response time. However, for 32 instances, response times jumped to an average of 95ms.



The main trick to achieve horizontal scalability: Optimize thread scheduling

Solaris is using the timeshare class as the default scheduling class. The scheduler needs to always make sure that the thread priorities are adequately balanced. For this test, we are running thousand of threads running this workload and can get critical CPU User Time back by avoiding unnecessary work by the scheduler. To achieve this, we are running the MySQL engines and Java processes in the Fixed Priority class. This is achieved easily using the Solaris priocntl command.


Conclusion
As I mentioned in introduction, an architecture shift is happening. Database sharding and application partitioning are the foundation of future information systems as pioneered by companies like Facebook [see this interesting blog entry]. This article prove that Sun Microsystems servers with CoolThread technology are an exceptional foundation for this change. And they will also considerably lower your Total Cost of Ownership as illustrated in this customer success story.



A very special thank you to the following experts who helped in the process or reviewed this article : Huon Sok, Allan Packer, Phil Morris, Mark Mulligan, Linda Kateley, Kevin Figiel and Patrick Cyril.

See you next time in the wonderful world of benchmarking....



<script src="http://www.google-analytics.com/urchin.js" type="text/javascript"> </script> <script type="text/javascript"> _uacct="UA-917120-1"; urchinTracker(); </script>

Tuesday May 20, 2008

The Hare and the Tortoise [X6250 vs T6320] or [INTEL XEON E5410 vs SUN UltraSPARC-T2 ]

The Hare and The Tortoise
View Benoit's profile on LinkedIn


"To win a race the swiftness of a dart ... Availeth not without a timely start"

LeLievreEtLaTortue
 

The tree on yonder hill we spy [Sun Blade 6000
Modular Systems]
The Sun Blade 6000 chassis support up to ten blades in a ten rack-unit chassis and is extremely popular due to its versatility. In fact, you can test your application today on four different chips within the same chassis. (UltraSPARC-T1 [T6300], UltraSPARC-T2 [T6320], AMD Opteron dual-core [X6220] and INTEL Xeon dual-core and quad-core [X6250]. While the Opteron and T1 blades have performance characteristics well defined by now, I was really curious to see how the new T2 blade will perform when compared to the Xeon Quad-Core.

A grain or two of hellebore [Chips & Systems]
In term of chips details, the T2 and Xeon are diverging. The three key differences are the total number of strands [16 times for the T2], the CPU frequency [1.66 times more for the Xeon] and the L2 cache size [3 times more for the Xeon].

This simple table illustrate their key characteristics :

Feature
INTEL Xeon E5410
SUN UltraSPARC-T2
Process
45 nm
65 nm
Transistors
820 million
500 million
Cores
4
8
Strands/core
1
8
Total #strands
4
64
Frequency
2.33Ghz
1.4Ghz
L1 cache
16KB I. + 16KB D.
16KB I. + 8KB D.
L2 cache
12 MB
4 MB
Nominal Power
80 W
95 W

This table makes it clear that predicting response time or throughput  delta between this two chips is a risky endeavor !

X6250T6320


Following this two pictures [X6250 and T6320], here is our hardware list :

Role Model
System clock
Sockets@freq
RAM
T2 blade
T6320
N/A
1@1.4Ghz
32 GB
Xeon blade
X6250
1333 Mhz
2@2.33Ghz
32 GB
Console
X4200
1000 Mhz
2@2.4Ghz
8 GB


I dare you to the wager still [Benchmarks]
I ran several benchmarks (including Oracle workloads) on all type of blades, but for the purpose of this article I will present only the two simple micro-benchmarks iGenCPU and iGenRAM.

The iGenCPU benchmark is a JavaTM-based CPU micro-benchmark used to compare the CPU performance of different systems. Based on a customized Java complex number library, the code is computing Benoit Mandelbrot's highly dense fractal structure using integer and floating-point calculations. (50%/50%) The simplicity of the code as well as its non-recursivity allow a very scalable behavior using less than 128 Kb of memory per thread. The exact throughput in number of fractals per second and average response times are reported and coalesced for each scalability level.

The iGenRAM benchmark is based on the California lotto requirements. The main purpose of this workload is to measure multi-threaded memory allocation and multi-threaded memory searches in Java. The first step of the benchmark is for each thread to allocate 512 Megabytes of memory in a 3-dimensional integer arrays. The second step is to search through this memory to determine the winning tickets. The exact throughput in lotto tickets per millisecond as well as the average allocation and search time are reported and coalesced for each scalability level.

 For this test, we used Solaris 10 Update 4 and Java version 1.6.1.

And list wich way the zephyr blows [Results]

Here are the iGenCPU throughput & response time :

iGenCPU_blade

Notes :

1-The Hare [X6250] is starting very fast but gets tired at 8 threads and really slow down at 12 threads
2-The Tortoise [T6320] reach more than twice the throughput of the Hare at 60 threads.
3-Single threaded average transaction response time is two times better on the Hare.

Now let's look at the iGenRAM results :

iGenRAM_blade.


Notes :

1-Phenomenal memory throughput of the Hare [X6250] at low level of threads. But in peak, the Tortoise [T6320] achieve 11% more throughput
2-When the Hare is giving up (~7 threads), the Tortoise is just warming up, reaching its peak throughput at about 40 threads.
3-Single-threaded, it takes 9 ms to allocate 512 Mb on the Hare, 33 ms to do the same thing on the Tortoise.
4-Single-threaded, it takes 5 ms to search through 512 Mb on the Hare, 34 ms to do the same thing on the Tortoise.


Conclusion

The race is by the tortoise won.
Cries she, "My senses do I lack ?
What boots your boasted swiftness now ?
You're beat ! and yet you must allow,
I bore my house upon my back."

See you next time in the wonderful world of benchmarking....
Special thanks to Mr Jean De La Fontaine [1621-1695]


<script src="http://www.google-analytics.com/urchin.js" type="text/javascript"> </script> <script type="text/javascript"> _uacct="UA-917120-1"; urchinTracker(); </script>

Wednesday Nov 14, 2007

OLTP performance of the Sun SPARC Enterprise M9000 on Solaris 10 08/07

M5000_E2900_X4600
I recently published a performance comparison of the Sun Fire E25k and the new Sun SPARC Enterprise M9000.
 In this article, a lot of my readers noticed the following note :
"Oracle OLTP is disappointing on the M9000 with an increase in response time at peak throughput. Upcoming release of Solaris and Oracle 10g should improve this result"

Critical bug fixes

 The reason why I wrote this is because I knew that Sun engineering was working hard at fixing three key performance bugs specific to database performance on any of the M-serie systems. Here is a list of this bugs that were successfully fixed in Solaris 10 08/07 (Update 4) :

1. Bug 6451741
SPARC64 VI prefetch tuning needs to be completed
Impact : L2 cache efficiency is key to database memory performance. Corrected preferch values improve memory read and write performance.

2. Bug 6486343
Mutex performance on large M-serie system need improvement
Impact : The mutex retry and backoff algorithm needed to be retuned for M-series system due to out-of-order execution and platform specific branch prediction routines. Also improve lock concurrency on hot mermory pages

3. Bug 6487440
Memory copy operations needs tuning on M-serie systems
Impact : The least important fix but important for Oracle stored procedures , triggers and constraints

The big question was : How much of an improvement it would have on OLTP performance ?
Well, one thing is sure is that your mileage may vary but I measured on my workload a whooping 1.33
 lower response times for 1.38 faster throughput (compared to Solaris 10 Update 3) . It is also interesting to notice that all the other workloads tested have not moved significantly as they are not really sensitive to the issues tackled there.

Please find below the corrected comparative charts in throughput and response time after a reminder on the workloads :


Java workloads

Not exactly.So let's try to be a little bit more specific using five different 100% Java (1.6) workloads :
  1. iGenCPU v3 - Fractal simulation 50% Integer / 50% floating point
  2. iGenRAM v3 - Lotto simulation (Memory allocation and search
  3. iGenBATCH v2 - Oracle 10g batch using partionning, triggers, stored procedures and sequences
  4. iGenOLTP v4 -(Heavy-weight OLTP

Datapoints

The values showed hare are peak results obtained by building the complete scalability curve. The response times mentioned are average, at peak and in Milliseconds.



E25k
M9000

Throughput RT (ms) Throughput RT (ms)
iGenCPU v3 303 fractals/second 105 728 fractals/second 44
iGenRAM v3 2865 lottos/ms 55 4881 lottos/ms 17
iGenBatch v2 35 TPS 907 50 TPS 626
iGenOLTP v4 3938 TPM 271 6194 TPM 264

As we are trying to compare to the frequency 1.267 factor, let's look at those results  by giving a factor 1 to the E25k.

First, here is throughput :

Throughput E25k M9000
'iGenCPU v3 1 2.403
'iGenRAM v3 1 1.704
'iGenBATCH v2 1 1.450
'iGenOLTP v4 1 1.573
Frequency 1 1.267

Which would be this chart :


tp2


And here is the average  reponse time at peak throughput (still using a base 1 for the E25k) :

RT E25k M9000
iGenCPU v3 1 0.419
iGenRAM v3 1 0.301
iGenBATCH v2 1 0.690
iGenOLTP v4 1 0.970


And the chart :

rt2




This new numbers are illustrating how well placed are the M-serie servers to replace the current UltraSPARC-IV servers, from the smallest Sun Fire V490 to the largest Sun Fire E25k...As long as you use at least Solaris 10 08/07 .

See you next time in the wonderful world of benchmarking...



Monday Sep 17, 2007

Solaris Vista dual-boot : No problem !

Solaris Vista dual-boot I am glad to report that I just successfully & flawlessly installed Solaris Nevada b72 and Vista Ultimate on a Ferrari 5000 laptop.

Summary of the operations :

1. The laptop had already Vista installed in C: (70G) with a D: partition (70G)
2. Using the Vista Disk Partitioner (default System tool in Vista Ultimate), I removed the D: partition
3. I downloaded Solaris Nevada build 72 and burned a DVD-R
4. I went in the Setup menu of the Ferrari 5000 and allowed boot only from the DVD
5. I booted Solaris b72 and chose the option (3) Terminal
6. I partition my disk to create a single Solaris partition with :
fdisk /dev/rdsk/c0d0p0
7.  Reboot and installed Solaris. Installation was about 50 minutes.
8. Booted again from the DVD . Chose option (3)
9. Modified /a/boot/grub/menu.lst by adding :
title Windows Vista
rootnoverify (hd0,1)
chainloader +1
10. Went back in the boot menu (F2) and re-enable disk booting.
11. Rebooted and verified that I could use Solaris & Vista.
12. Booted Solaris, installed SLAMD and the iGen benchmark suite
13. Ran the iGenCPU benchmark to compare the system to others. Got 27 fractals/second at 4 threads. Nice for a laptop !

Additional note : Wireless configuration is now very easy as the wificonfig tool is part of the Nevada distribution
The only thing needed is update_drv -a -i '"pciex168,1c"' ath . No reboot necessary.
Then you can do wificonfig -i ath0 plumb ; wificonfig -i ath0 scan

Final note : All the tricks that you can found in other blogs are now irrelevant as the MBR Solaris bug was bixed in build 70.




Monday Aug 20, 2007

Sun SPARC Enterprise M9000 vs Sun Fire E25k - Datapoints

Sun SPARC Enterprise M9000 vs Sun Fire E25k - Datapoints
A performance comparison of two high-end UNIX servers using the iGen benchmark suite
[Read More]

Wednesday Nov 08, 2006

Unbreakable Oracle 10g Release 2 : What if you have ORA-600 kcratr1_lastbwr ?

ORA-600_kcratr1 <script src="http://www.google-analytics.com/urchin.js" type="text/javascript"> </script> <script type="text/javascript"> _uacct = "UA-917120-1"; urchinTracker(); </script>
This an interesting story that happened yesterday on one of our customer site. An engineer powered off the wrong rack of equipment containing a Sun Fire X4600 running Oracle 10g Release 2.  Almost no transactions were performed at time so when the system came up the customer expected the database to be up and running very quickly.

In reality this is what happened :

Completed: ALTER DATABASE   MOUNT
Tue Nov  7 11:19:42 2006
ALTER DATABASE OPEN
Tue Nov  7 11:19:42 2006
Beginning crash recovery of 1 threads
 parallel recovery started with 16 processes
Tue Nov  7 11:19:44 2006
Started redo scan
Tue Nov  7 11:19:44 2006
Errors in file /xxx/oracle/oracle/product/10.2.0/db_1/admin/xxx/udump/xxx_ora_947.trc:
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [], [], [], []
Tue Nov  7 11:19:44 2006
Aborting crash recovery due to error 600
Tue Nov  7 11:19:44 2006
Errors in file /xxx/oracle/oracle/product/10.2.0/db_1/admin/xxxtest/udump/xxxtest_ora_947.trc:
ORA-00600: internal error code, arguments: [kcratr1_lastbwr], [], [], [], [], [], [], []
ORA-600 signalled during: ALTER DATABASE OPEN...

Not too pretty ! Checking the ASM configuration and the IO subsystem showed nothing wrong. So what to do if you do not have a backup handy ?

Well, here is the idea .... what would we do if we had a backup that was inconsistent ?
The recover database command will start an Oracle process which will roll forward all transactions stored in the restored archived logs necessary to make the database consistent again. The recovery process must run up to a point that corresponds with the time just before the error occurred after which the log sequence must be reset to prevent any further system changes from being applied to the database.

So we tried :

startup mount

ALTER DATABASE   MOUNT
Tue Nov  7 11:54:03 2006
Starting background process ASMB
ASMB started with pid=61, OS id=1070
Starting background process RBAL
RBAL started with pid=67, OS id=1074
Tue Nov  7 11:54:13 2006
SUCCESS: diskgroup xxxTESTDATA was mounted
Tue Nov  7 11:54:17 2006
Setting recovery target incarnation to 2
Tue Nov  7 11:54:17 2006
Successful mount of redo thread 1, with mount id 2364224219
Tue Nov  7 11:54:17 2006
Database mounted in Exclusive Mode
Completed: ALTER DATABASE   MOUNT
Tue Nov  7 11:54:32 2006

recover database


ALTER DATABASE RECOVER  database 
Tue Nov  7 11:54:32 2006
Media Recovery Start
 parallel recovery started with 16 processes
Tue Nov  7 11:54:33 2006
Recovery of Online Redo Log: Thread 1 Group 3 Seq 4 Reading mem 0
  Mem# 0 errs 0: +xxxTESTDATA/xxxtest/onlinelog/group_3.263.605819131
Tue Nov  7 11:59:25 2006
Media Recovery Complete (xxxtest)
Tue Nov  7 11:59:27 2006
Completed: ALTER DATABASE RECOVER  database 


alter database open

Tue Nov  7 12:03:01 2006
alter database open
Tue Nov  7 12:03:01 2006
Beginning crash recovery of 1 threads
 parallel recovery started with 16 processes
Tue Nov  7 12:03:01 2006
Started redo scan
Tue Nov  7 12:03:01 2006
Completed redo scan
 273 redo blocks read, 0 data blocks need recovery
Tue Nov  7 12:03:01 2006
Started redo application at
 Thread 1: logseq 4, block 12858574
Tue Nov  7 12:03:01 2006
Recovery of Online Redo Log: Thread 1 Group 3 Seq 4 Reading mem 0
  Mem# 0 errs 0: +xxxTESTDATA/xxxtest/onlinelog/group_3.263.605819131
Tue Nov  7 12:03:01 2006
Completed redo application
Tue Nov  7 12:03:01 2006
Completed crash recovery at
 Thread 1: logseq 4, block 12858847, scn 824040
 0 data blocks read, 0 data blocks written, 273 redo blocks read
Tue Nov  7 12:03:02 2006
Thread 1 advanced to log sequence 5
Thread 1 opened at log sequence 5
  Current log# 1 seq# 5 mem# 0: +xxxTESTDATA/xxxtest/onlinelog/group_1.261.605819081
Successful open of redo thread 1
Tue Nov  7 12:03:02 2006
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Tue Nov  7 12:03:02 2006
SMON: enabling cache recovery
Tue Nov  7 12:03:03 2006
Successfully onlined Undo Tablespace 1.
Tue Nov  7 12:03:03 2006
SMON: enabling tx recovery
Tue Nov  7 12:03:03 2006
Database Characterset is UTF8
replication_dependency_tracking turned off (no async multimaster replication found)
Starting background process QMNC
QMNC started with pid=56, OS id=1128
Tue Nov  7 12:03:05 2006
Completed: alter database open



And we are up and running ! The real thing that Oracle should work on is the quality and clarity of their error messages.
At this point this is quite poor ...

 Unbreakable database, maybe. Automatic (and simple) , not yet.

Thursday Oct 19, 2006

Do you need an OLTP benchmark for ANSI v2 databases ?

OLTP_benchmark Very simple question for you today.

Do you need a new OLTP benchmark ?

A benchmark that could be lightweight as well as heavyweight ?
That could be IO intensive or CPU intensive ?
That could run on any database and any operating system ?
That you could run on your laptop but also scale up to a 144 cpus Sun Fire E25k+ ?
That could run standalone, in client/server or in a 3 tier model ?
That would produce instantly color charts and comprehensive PDF or HTML reports ?

Well, send me an email or a comment ( I am sure you are smart enough to find my email address somewhere)
if you want it. If my mailbox becomes full - I'll see what I can do....

MrBenchmark


Monday Jan 02, 2006

A second benchmark to compare V40z, V490 and T2000 : iGenRAM v1.2

MrBenchmark iGenRAM v1.2 After our performance investigation on integer& floating-point calculations with iGenCPU , let me present today a second microbenchmark called iGenRAM.

The iGenRAM v1.2 benchmark is a Java-based memory application useful to compare the memory performance of different systems. Based on the functional requirements of the California Lotto, this application is simulating a California lotto play consisting of :
  1. Players play Lotto tickets by choosing series of six numbers. Each thread is simulating 6 million tickets played. To store tickets, memory is allocated in Java in the form of multi-dimensional integer arrays.

  2. The system generates a list of six winning numbers.

  3. The system has to determine which ticket won and for what amount. For this, it needs to browse all tickets and compare to the winning combination.

  4. Total duration of this tasks will produce the throughput in Lotto computed per seconds or iGenRAM_Thp and the iGenRAM_RT average response time.

The pressure on the system is mainly on the memory subsystem , just using a minimal amount of CPU power.

Systems with low memory latency and scalable  memory interconnect  will succeed. We expect good things from the Sun Fire T2000.

iGenRAMv1.2

Next up :
iGenRAM v1.2 results on V490, V40z and T2000


Friday Dec 16, 2005

What is SWaP and iGenCPU SWaP values

What is SWaP and iGenCPU SWaP values As you may have noticed, Sun created SWaP--the Space, Watts and Performance (SWaP) metric.SWaP components are :
  • Performance: Using industry-standard benchmarks or your own benchmark !

  • Space: Measuring the height of the server in rack units (RUs).

  • Power: Determining the watts consumed by the system, using data from actual benchmark runs or vendor site planning guides

The SWaP metric is calculated this way :

SWaP

I recently provided iGenCPU v2.1 benchmark results for various platforms (see previous entries). I did start with this benchmark as it is the absolute worst case for the T2000. Let see how it translates into SWaP numbers :

SWaP_Table

Or as expressed in a chart :

SWaP_iGenCPU21

So what is the message here ?

If you are running floating point intensive applications (the immense majority of commercial applications are not), and you need a small form factor, the AMD equipped V20Z/V40z or better the Galaxy line ( Sun Fire™ X4100 and Sun Fire™ X4200 ) are the right answer.

I hope I convinced you here that the SWaP metric was not designed to make the UltraSPARC-T1 a winner every time  but to provide a critical metric for modern datacenters !

In 2006, we will look at a second microbenchmark called iGenRAM , and then explore LDAP performance (iGenLDAP), database performance (iGenOLTP) and even web performance (iGenWEB).

I am leaving for Tahoe, so see you on the Heavenly slopes or next year on this forum...







Thursday Dec 15, 2005

MrBenchmark iGen benchmarks : A clarification

MrBenchmark iGen benchmarks : A little clarification
Thank you for your comments on my previous post.
Walter's guidelines are good and applicable to all standard benchmarks. Please note that my benchmarks are not standards.. all results are provided for your information and without any type of performance guarantee....

If you are searching for standard benchmarks results , do not hesitate to go there  (BM seer blog)

Also, apple-to-apple comparisons is only a dream from my prospective. Why ? Because all of this systems (and processors) are different from the bottom-up...

I do see my comparison data as a way to provide an approximate idea of how systems rank versus each other fon a specific workload. No benchmark is universal...so you can never say System X is better/faster than System Y without being specific on the application tested.

Next up : SWaP values for iGenCPU v2.1

Wednesday Dec 14, 2005

iGenCPU results on UltraSPARC T1 (T2000) and UltraSPARC IV+ (V490+)

igencpu21_1 Hi all;

This is an update on my previous entry, adding the UltraSPARC T1 CoolThread server T2000 and the V490+.The throughput increase between the UltraSPARC IV+ and the UltraSPARC IV is fairly proportional to the clock frequency.No surprise here. The simplicity of this microbenchmark does not allow taking benfits of many of the UltraSPARC IV+ innovations.

Regarding the T2000, please note that I am not afraid to publish the results. This is not a marketing blog...Indeed, this benchmark is not recommended for the UltraSPARC T1 due to the fact that 25% of the instructions are floating point operations (see two previous blog entries).

It does not prevent us to collect the data. Please remember that we want our customers to run the right platform for their workload. So, if your workload is generating more than 2% of floating point operations, the UltraSPARC T1 is probably not what you should choose....

Note : The column threads is the number of threads used to observe the highest throughput with a response time (RT) less than 100 ms.


Processor Frequency # CPU # Cores Ram OS Threads Fractals/s RT (ms)
Intel XEON 3 Ghz 2 4 (HT) 4 GB S10 03/05 3 31.16 96.26
Sun UltraSPARCIIIi 1.2Ghz 4 4 8 GB S10 03/05 4 53.01 75.4
AMD OPTERON 2.4Ghz 4 4 8 GB S10 HW2 5 90.18 55.4
Sun UltraSPARC IV 1.2Ghz 4 8 8 GB S10 03/05 8 98.88 80.8
Sun UltraSPARC IV+
1.5Ghx
4
8
8 GB
S10 HW1
8
123.48
78.36
Sun UltraSPARC T1
1.2Ghz
1
8
8 GB
S10 HW2
9
18.62
93.08





iGenCPU21_1+






Let me know your thoughts....
Next, I will publish a description of the iGenRAM 1.6 and related benchmark results. We will see that the T1 is pretty good at this....

Monday Dec 12, 2005

iGenCPU 2.1 results - V40z, V65x, V490 and V440

igencpu21_1Hi all;

As promised here are our first iGenCPU v2.1 benchmark results. See previous blog entry for the benchmark description.
As reported by my pfp tool ,  this benchmark is producing about 25% of floating operations and 75% others...
Therefore, absolutely not recommended for a UltraSPARC T1-based T1000 and T2000...

This table will show us the performance obtained on this benchmark for four popular Sun Microsystems servers :
The V40z  single core , the V65x, the V440 and the V490 all using Solaris 10 .  Please note that this servers may be available
today at a higher frequency.

Note : The column threads is the number of threads used to observe the highest throughput with a response time (RT) less than 100 ms.


Server Processor Frequency # CPU # Cores Ram OS Threads Fractals/s RT (ms)
V65x Intel XEON 3 Ghz 2 4 (HT) 4 GB S10 03/05 3 31.16 96.26
V440 Sun UltraSPARCIIIi 1.2Ghz 4 4 8 GB S10 03/05 4 53.01 75.4
V40z AMD OPTERON 2.4Ghz 4 4 8 GB S10 HW2 5 90.18 55.4
V490 Sun UltraSPARC IV 1.2Ghz 4 8 8 GB S10 03/05 8 98.88 80.8




.


iGenCPU21_1






Please use  the comments section for your observations, I am sure you will have plenty...

Next, I will publish iGenCPU 2.1 results for UltraSPARC T1 and UltraSPARC IV+ and provide my observations....

Friday Dec 09, 2005

MrBenchmark benchmarks : Opteron vs UltraSPARC IV vs UltraSPARC T1

iGenCPU 1 Hi all;

Whiners came to me saying : "MrBenchmark : Enough theory, please give us some benchmark results.."
And I said, fine...so here we are ..I will publish some informal benchmark results in this forum. And Yes I will compare
UltraSPARC IV, IV+,  Opteron , UltraSPARC T1 and even Xeon !

Let me present the first microbenchmark of my serie . It is called iGenCPU. It is written in 100% pure Java. I am using Java 1.5

The iGenCPU benchmark is a JavaTM-based CPU micro-benchmark used to compare the CPU performance of different systems.
Based on a customized Java complex number library, the code is computing Benoit Mandelbrot's highly dense fractal structure using
integer  and floating-point calculations. The simplicity of the code as well as its non-recursivity allow a very scalable behavior using 
less than 64 Mb of memory per thread.

iGenCPU reports multiple statistics. We  are mostly interested in analyzing iGenCPU_Thp (how many fractals per second can we
 compute with this number of threads ?) and iGenCPU_RT (what is the average time needed to compute a complete fractal with this number of threads ?)

IgenCPU use the system this way as represented by my iTarget chart :

iTarget_iGenCPU


Next to come : our first iGenCPU benchmark result (table & diagram ): V40z (4xAMD Opteron @2.4Ghz  with 8GB RAM ) vs V490 (4xUltraSPARC IV 1.2Ghz 8GB RAM)



About

mrbenchmark

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News
Blogroll
deepdive

No bookmarks in folder