Monday Jul 06, 2009

What processor will fuel your first private Cloud : INTEL Nehalem or AMD Istanbul ?

>What processor will fuel your first private Cloud : INTEL Nehalem or AMD Istanbul ?

Where IT is going ...
You may have observed the big trend of the moment : Take your old slide decks, banners and marketing brochures and try to plug in the word cloud as many times as possible. A current Google search of the words Cloud Computing yield today more than 31 million results ! Even if you search only on Cloud (getting 175 Million+ results), the first entry in the list (discounting the Sponsored results) is this one. Amazing fashion of the moment !

As we recently described in this white paper, there are not one but many clouds. I had recent conversations on this topic with customers in our Menlo Park Executive Briefing Center . While they all say that they will not be able to host their entire IT department in a Public Cloud. , they are interested in the notion of combining a Public cloud service with multiple Private Clouds - this is the notion of Hybrid Cloud.








Private clouds
The Sun Solution Centers and SUN Professional Services are starting now to build the first private clouds architectures based on Sun Open Source products. The most common building block for those is the versatile Sun Blade 6000. Why ? Because of the capacity of this chassis to host many different type of CPU's (x86 & SPARC) and operating systems (Windows, Linux, OpenSolaris, Solaris or even Vmware vSphere). At the same time, INTEL and AMD have released two exceptional chips : the INTEL XEON 5500 (code name Nehalem) and the six-core AMD Opteron (code name Istanbul). I had the opportunity to test these chips recently and will give you here a few data points.





Cloud benchmarks

We may not have today any Cloud related standard benchmarks. However, if I look at the different software components of a private cloud, it seems that Computing capabilities (in integer and floating point) and Memory Performance are the two key dimensions to explore. You may argue that your cloud need a database component ...but improved caching mechanism (memcached for example) and the commoditization of Solid State Disks (see this market analysis and also here) are moving database performance profiles toward memory or cpu intensive workloads. Additionally, the exceptional power of 10-Gbit based Hybrid storage appliances (like the Sun Storage 7410 Unified Storage System) makes us less concerned by I/O & network bound situations. It is good to know that this new storage appliances are a key element of our public cloud infrastructure.








Nehalem & Istanbul Executive summary

Both AMD & INTEL had customer investments in mind as their new chips use the same sockets than before ... so they can be used in previously released chassis. What you will typically have to do after upgrading to the new processors is to download the latest platform BIOS. Another good idea is also to check on your OS level ... the latest OS releases include upgraded libraries and drivers. Those are critical if performance is near the top of your shopping list. See here for example.

For other features, please refer to the key characteristics below :

Feature

INTEL Xeon X5500 (Nehalem)

AMD Opteron 2435 (Istanbul)

Release date

March 29, 2009

June 1st, 2009

Manufacturing

45 nm

45 nm

Frequency (tested)

2.8Ghz

2.6Ghz

Cores

4

6

Strands/core

2 [if NUMA on]

1

Total #strands

8

6

L1 cache

256 KB [32KB I. + 32KB D. per core]

768 KB [128 KB per core]

L2 Cache

1 MB [256KB per core]

3 MB [512KB per core]

L3 cache

2 MB shared

6 MB shared

Memory type

DDR3 1333Mhz max. \*

DDR2 800 Mhz

Nom. Power

95 W

75W

Major Innovations

Second level branch predictor & TLB

Power savings and HW virtualization

Note : For this test, we used DDR3 1066Mhz.

Now, here is our hardware list :

Role

Model

Blade

Sockets@freq

RAM

AMD Opteron 'Istanbul'

SB6000

X6260

2@2.6Ghz

24 GB

INTEL XEON 'Nehalem'

SB6000

X6270

2@2.8Ghz

24 GB

Console

X4150

N/A

2@2.8Ghz

16 GB




Calculation performance : iGenCPU

iGenCPU is a calculation benchmark written in Java. It calculates Benoit Mandelbrot's fractals using a custom Imaginary Numbers library. The main benefit of this workload is that it naturally creates a 50% floating point and 50% integer calculation. As the number of floating operations produced by commercial software increase every year, this type of performance profile is getting closer and closer to what modern web servers (like Apache) and application servers (like Glassfish) will produce.


Here are the results (AMD Istanbul in Blue, INTEL Nehalem in Red) :




Observations :

  1. Very similar peak throughput (984 fractals/s on INTEL, 1008 fractals/s on AMD)

  2. The AMD chip produce superior throughput at any level of concurrency. At 8 threads, which is a very common scalability limit for commercial virtualization products, it produces 28% more throughput than Nehalem.

  3. It shows the superiority of the Opteron calculation co-processors as we had already observed on previous quad-core generation.

  4. It is more important for calculation to have larger L1/L2 cache then faster L1/L2 cache. The Opteron micro-architecture is naturally a better fit for this workload.




Memory performance : iGenRAM

It is a classic brain exercise when you can not sleep : imagine what you would do with $94 million in your bank account. The iGenRAM benchmark was initially developed in C to produce an accurate simulation of the California Lotto winner determination. It is highly memory intensive using 1Gigabyte of memory per thread. Memory allocation time as well as memory search performance produce a combined throughput number plotted below :



Observations :

  1. The faster DDR3 memory and higher frequency of the INTEL chip make it a better fit for memory intensive workloads. In peak, the Nehalem based system produce 23% more throughput than its competitor.

  2. For a small number of threads (1 to 4), both system produce very similar numbers.

  3. Second level predictor on this repetitive workload most likely help the Nehalem-based system to improve its scalability curve tangent past four threads

  4. As noted, we used DDR3 1066Mhz for this Nehalem test. DDR3 1333Mhz is also available and will increase the INTEL chip advantage on this workload.








Conclusion

At complex question, complex answer... As you have noted, these benchmarks show the AMD Istanbul better suited for calculation intensive workloads but also show better memory performance of the INTEL Nehalem. Therefore, different layers within your private cloud will need to be profiled if you want to determine what is your best choice. And guess which Operating System comes equipped with the right set of tools (I.e Dynamic Tracing) to make the determination : Solaris or OpenSolaris .

[Last minute note: I also performed Oracle 10g database benchmarks on these blades. Maybe for another article..]





See you next time in the wonderful world of benchmarking....



Monday May 11, 2009

Running your Oracle database on internal Solid State Disks : a good idea ?

Scaling MySQL and ZFS on T5440




Solid State Disks : a 2009 fashion

This technology is not new : it originates in 1874 when a German physicist named Karl Braun (pictured above) discovered that he could rectify alternating current with a point-contact semiconductor. Three years later, he had built the first CRT oscilloscope and four years later, he had built the first prototype of a Cat's whisker diode, later optimized by G. Marconi and G. Pickard. In 1909, K. Braun shared the Nobel Prize for physics with G. Marconi.

The Cat's whisker diodes are considered the first solid state devices. But it is only in the 1970s that they appeared in high-end mainframes produced by Amdahl and Cray Research. However, their high-cost of fabrication limited their industrialization. Several companies attempted later to introduce the technology to the mass market including StorageTek, Sharp and M-systems. But the market was not ready.

Nowadays, SSDs are composed of one of two technologies : DRAM volatile memory or NAND-flash non-volatile memory. Key recent announcements from Sun (Amber road and ZFS), HP (IO Accelerator) and Texas Instruments (Ram San 620) as well as lower cost of fabrication and larger capacities are making the NAND based technology a must-try for every company this year.

This article is looking at the Oracle database performance of our new 32Gbytes SSDs OEM'd from Intel. This new devices have improved their I/O capacity and MTBF with an architecture featuring 10 parallel NAND flash channels. See this announcement for more.

If you dig a little bit on the question, you will find this whitepaper . However, the 35% boost in performance that they measured seems insufficient to justify trashing HDDs for SSDs. In addition, as they compare a different number of HDDs and SSDs, it is very hard to determine the impact of a one-to-one replacement. Let's make our own observations.



Here is a picture of the SSD tested – thanks to Emie for the shot !





Goals

As any DBA knows, it is very difficult to characterize a database workload in general. We are all very familiar with the famous “Your mileage may vary” or “All customer database workloads are different”. And we can not trust Marketing department on SSDs performance claims because nobody is running a synthetic I/O generator for a living. What we need to determine is the impact for End-Users (Response time anyone ?) and how the Capacity Planners can benefit from the technology (How about Peak Throughput ?).

My plan is to perform two tests on a Sun Blade X6270 (Nehalem-based) equipped with two Xeon chips and 32Gb of RAM on one SSD and one HDD- with different expectations.

  1. Create a 16 Gigabytes database that will be entirely cached in the Oracle SGA. Will we observe any difference ?

  2. Create a 50 Gigabytes database that can only be cached about 50% of the time. We expect a significant performance impact. But how much ?


SLAMD and iGenOLTP
The SLAMD Distributed Load Generation Engine (SLAMD) is a Java-based application designed for stress testing and performance analysis of network-based applications. It was originally developed by Sun Microsystems, Inc., but it has been released as an open source application under the Sun Public License, which is an OSI-approved open source license. The main site for obtaining information about SLAMD is http://www.slamd.com/. It is also available as a java.net project.

iGenOLTP is a multi-processed and multi-threaded database benchmark. As a custom Java class for SLAMD, it is a lightweight workload composed of four select statements, one insert and one delete. It produces a 90% read/10% write workload simulating a global order system.



Software and Hardware summary

This study is using Solaris 10 Update 6 (released October 31st,2008), Java 1.7 build 38 (released Otober 23rd,2008), SLAMD 1.8.2, iGenOLTP v4 for Oracle and Oracle 10.2.0.2. The hardware tested is a Sun Blade X6270 with 2xINTEL XEON X5560 2.8Ghz and 32 GB of DDR3 RAM . This blade has four standard 2.5 inches disks slots in which we are installing 1x32 Gbytes Sun/Intel SSD and 1x146Gb 10k RPM SEAGATE-ST914602SS drive with read-cache and write-cache enabled.

Test 1 – Database mostly in memory

We are creating a 16 Gigabytes database (4k block size) on one Solid State Disk and on one Seagate HDD configured in one ZFS pool with the default block size. We are limiting the ZFS buffer cache to 1 Gigabytes and allow an Oracle SGA of 24 Gigabytes. All the database will be cached. We will feel the SSD impact only on random writes (about 10% of the I/O operations) and sequential writes (Oracle redo log). The test will become CPU bound as we increase concurrency. We are testing from 1 to 20 client threads (I.e database connections) in streams.


In this case and for Throughput [in Transactions per second], the difference between HDD and SSD are evoluting from significant to modest when concurrency increase. In fact, this is interestingly in the midrange of the scalability curve that we observe a peak of 71% more throughput on the SSD (at 4 threads). At 20 threads, we are mostly CPU bound, therefore the impact of the storage type is minimal and the SSD impact on throughput is only 9%.






For response times [in milliseconds], it is slightly lower with 42% better response times at 4 threads and 8% better at 20 threads.






Test 2 – Database mostly on disk

This time, we are creating a 50 Gigabytes database on one SSD and on one HDD configured in their dedicated ZFS pool. Memory usage will be sliced the same way than test 1 but will not be able to cache more than 50% of the entire database. As a result, we will become I/O bound before we become CPU bound. Please remember that the X6270 is equipped with two eight-threads X5560 - a very decent 16-way database server !

Here are the results :



The largest difference is observed at 12 threads with more than twice the transactional throughput on the SSD. In response times (below), we observe the SSD to be 57% faster in peak and 59% faster at 8 threads.




In a nutshell

My intent for this test was to show you (for a classic Oracle lightweight OLTP workload)

the good news :

When I/O bound, we can replace two Seagate 10k RPM HDDs with one INTEL/SUN SSD for a similar throughput and twice faster response times

On a one for one basis, the response time difference by itself (up to 67%) will make your end users love you instantly !

Peak throughput in memory compared to the SSD is very close : in peak, we observed 821 TPS (24ms RT) in memory and 685 TPS (30ms RT) on the SSD. Very nice !


and the bad news :

When the workload is CPU bound, the impact of replacing your HDD by a SSD is moderate while losing a lot of capacity.

The cost per gigabyte need to be carefully calculated to justify the investment. Ask you Sales rep for more...


See you next time in the wonderful world of benchmarking....

Friday Apr 17, 2009

Sun Blade X6270 & INTEL XEON X5560 on OpenSolaris create the ultimate Directory Server

Sun Microsystems Directory Server Enterprise Edition 6.3 performance on X6270 (Nehalem)[Read More]

Wednesday Nov 14, 2007

OLTP performance of the Sun SPARC Enterprise M9000 on Solaris 10 08/07

M5000_E2900_X4600
I recently published a performance comparison of the Sun Fire E25k and the new Sun SPARC Enterprise M9000.
 In this article, a lot of my readers noticed the following note :
"Oracle OLTP is disappointing on the M9000 with an increase in response time at peak throughput. Upcoming release of Solaris and Oracle 10g should improve this result"

Critical bug fixes

 The reason why I wrote this is because I knew that Sun engineering was working hard at fixing three key performance bugs specific to database performance on any of the M-serie systems. Here is a list of this bugs that were successfully fixed in Solaris 10 08/07 (Update 4) :

1. Bug 6451741
SPARC64 VI prefetch tuning needs to be completed
Impact : L2 cache efficiency is key to database memory performance. Corrected preferch values improve memory read and write performance.

2. Bug 6486343
Mutex performance on large M-serie system need improvement
Impact : The mutex retry and backoff algorithm needed to be retuned for M-series system due to out-of-order execution and platform specific branch prediction routines. Also improve lock concurrency on hot mermory pages

3. Bug 6487440
Memory copy operations needs tuning on M-serie systems
Impact : The least important fix but important for Oracle stored procedures , triggers and constraints

The big question was : How much of an improvement it would have on OLTP performance ?
Well, one thing is sure is that your mileage may vary but I measured on my workload a whooping 1.33
 lower response times for 1.38 faster throughput (compared to Solaris 10 Update 3) . It is also interesting to notice that all the other workloads tested have not moved significantly as they are not really sensitive to the issues tackled there.

Please find below the corrected comparative charts in throughput and response time after a reminder on the workloads :


Java workloads

Not exactly.So let's try to be a little bit more specific using five different 100% Java (1.6) workloads :
  1. iGenCPU v3 - Fractal simulation 50% Integer / 50% floating point
  2. iGenRAM v3 - Lotto simulation (Memory allocation and search
  3. iGenBATCH v2 - Oracle 10g batch using partionning, triggers, stored procedures and sequences
  4. iGenOLTP v4 -(Heavy-weight OLTP

Datapoints

The values showed hare are peak results obtained by building the complete scalability curve. The response times mentioned are average, at peak and in Milliseconds.



E25k
M9000

Throughput RT (ms) Throughput RT (ms)
iGenCPU v3 303 fractals/second 105 728 fractals/second 44
iGenRAM v3 2865 lottos/ms 55 4881 lottos/ms 17
iGenBatch v2 35 TPS 907 50 TPS 626
iGenOLTP v4 3938 TPM 271 6194 TPM 264

As we are trying to compare to the frequency 1.267 factor, let's look at those results  by giving a factor 1 to the E25k.

First, here is throughput :

Throughput E25k M9000
'iGenCPU v3 1 2.403
'iGenRAM v3 1 1.704
'iGenBATCH v2 1 1.450
'iGenOLTP v4 1 1.573
Frequency 1 1.267

Which would be this chart :


tp2


And here is the average  reponse time at peak throughput (still using a base 1 for the E25k) :

RT E25k M9000
iGenCPU v3 1 0.419
iGenRAM v3 1 0.301
iGenBATCH v2 1 0.690
iGenOLTP v4 1 0.970


And the chart :

rt2




This new numbers are illustrating how well placed are the M-serie servers to replace the current UltraSPARC-IV servers, from the smallest Sun Fire V490 to the largest Sun Fire E25k...As long as you use at least Solaris 10 08/07 .

See you next time in the wonderful world of benchmarking...



About

mrbenchmark

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News
Blogroll
deepdive

No bookmarks in folder