Monday Jul 06, 2009

What processor will fuel your first private Cloud : INTEL Nehalem or AMD Istanbul ?

>What processor will fuel your first private Cloud : INTEL Nehalem or AMD Istanbul ?

Where IT is going ...
You may have observed the big trend of the moment : Take your old slide decks, banners and marketing brochures and try to plug in the word cloud as many times as possible. A current Google search of the words Cloud Computing yield today more than 31 million results ! Even if you search only on Cloud (getting 175 Million+ results), the first entry in the list (discounting the Sponsored results) is this one. Amazing fashion of the moment !

As we recently described in this white paper, there are not one but many clouds. I had recent conversations on this topic with customers in our Menlo Park Executive Briefing Center . While they all say that they will not be able to host their entire IT department in a Public Cloud. , they are interested in the notion of combining a Public cloud service with multiple Private Clouds - this is the notion of Hybrid Cloud.








Private clouds
The Sun Solution Centers and SUN Professional Services are starting now to build the first private clouds architectures based on Sun Open Source products. The most common building block for those is the versatile Sun Blade 6000. Why ? Because of the capacity of this chassis to host many different type of CPU's (x86 & SPARC) and operating systems (Windows, Linux, OpenSolaris, Solaris or even Vmware vSphere). At the same time, INTEL and AMD have released two exceptional chips : the INTEL XEON 5500 (code name Nehalem) and the six-core AMD Opteron (code name Istanbul). I had the opportunity to test these chips recently and will give you here a few data points.





Cloud benchmarks

We may not have today any Cloud related standard benchmarks. However, if I look at the different software components of a private cloud, it seems that Computing capabilities (in integer and floating point) and Memory Performance are the two key dimensions to explore. You may argue that your cloud need a database component ...but improved caching mechanism (memcached for example) and the commoditization of Solid State Disks (see this market analysis and also here) are moving database performance profiles toward memory or cpu intensive workloads. Additionally, the exceptional power of 10-Gbit based Hybrid storage appliances (like the Sun Storage 7410 Unified Storage System) makes us less concerned by I/O & network bound situations. It is good to know that this new storage appliances are a key element of our public cloud infrastructure.








Nehalem & Istanbul Executive summary

Both AMD & INTEL had customer investments in mind as their new chips use the same sockets than before ... so they can be used in previously released chassis. What you will typically have to do after upgrading to the new processors is to download the latest platform BIOS. Another good idea is also to check on your OS level ... the latest OS releases include upgraded libraries and drivers. Those are critical if performance is near the top of your shopping list. See here for example.

For other features, please refer to the key characteristics below :

Feature

INTEL Xeon X5500 (Nehalem)

AMD Opteron 2435 (Istanbul)

Release date

March 29, 2009

June 1st, 2009

Manufacturing

45 nm

45 nm

Frequency (tested)

2.8Ghz

2.6Ghz

Cores

4

6

Strands/core

2 [if NUMA on]

1

Total #strands

8

6

L1 cache

256 KB [32KB I. + 32KB D. per core]

768 KB [128 KB per core]

L2 Cache

1 MB [256KB per core]

3 MB [512KB per core]

L3 cache

2 MB shared

6 MB shared

Memory type

DDR3 1333Mhz max. \*

DDR2 800 Mhz

Nom. Power

95 W

75W

Major Innovations

Second level branch predictor & TLB

Power savings and HW virtualization

Note : For this test, we used DDR3 1066Mhz.

Now, here is our hardware list :

Role

Model

Blade

Sockets@freq

RAM

AMD Opteron 'Istanbul'

SB6000

X6260

2@2.6Ghz

24 GB

INTEL XEON 'Nehalem'

SB6000

X6270

2@2.8Ghz

24 GB

Console

X4150

N/A

2@2.8Ghz

16 GB




Calculation performance : iGenCPU

iGenCPU is a calculation benchmark written in Java. It calculates Benoit Mandelbrot's fractals using a custom Imaginary Numbers library. The main benefit of this workload is that it naturally creates a 50% floating point and 50% integer calculation. As the number of floating operations produced by commercial software increase every year, this type of performance profile is getting closer and closer to what modern web servers (like Apache) and application servers (like Glassfish) will produce.


Here are the results (AMD Istanbul in Blue, INTEL Nehalem in Red) :




Observations :

  1. Very similar peak throughput (984 fractals/s on INTEL, 1008 fractals/s on AMD)

  2. The AMD chip produce superior throughput at any level of concurrency. At 8 threads, which is a very common scalability limit for commercial virtualization products, it produces 28% more throughput than Nehalem.

  3. It shows the superiority of the Opteron calculation co-processors as we had already observed on previous quad-core generation.

  4. It is more important for calculation to have larger L1/L2 cache then faster L1/L2 cache. The Opteron micro-architecture is naturally a better fit for this workload.




Memory performance : iGenRAM

It is a classic brain exercise when you can not sleep : imagine what you would do with $94 million in your bank account. The iGenRAM benchmark was initially developed in C to produce an accurate simulation of the California Lotto winner determination. It is highly memory intensive using 1Gigabyte of memory per thread. Memory allocation time as well as memory search performance produce a combined throughput number plotted below :



Observations :

  1. The faster DDR3 memory and higher frequency of the INTEL chip make it a better fit for memory intensive workloads. In peak, the Nehalem based system produce 23% more throughput than its competitor.

  2. For a small number of threads (1 to 4), both system produce very similar numbers.

  3. Second level predictor on this repetitive workload most likely help the Nehalem-based system to improve its scalability curve tangent past four threads

  4. As noted, we used DDR3 1066Mhz for this Nehalem test. DDR3 1333Mhz is also available and will increase the INTEL chip advantage on this workload.








Conclusion

At complex question, complex answer... As you have noted, these benchmarks show the AMD Istanbul better suited for calculation intensive workloads but also show better memory performance of the INTEL Nehalem. Therefore, different layers within your private cloud will need to be profiled if you want to determine what is your best choice. And guess which Operating System comes equipped with the right set of tools (I.e Dynamic Tracing) to make the determination : Solaris or OpenSolaris .

[Last minute note: I also performed Oracle 10g database benchmarks on these blades. Maybe for another article..]





See you next time in the wonderful world of benchmarking....



Friday Apr 17, 2009

Sun Blade X6270 & INTEL XEON X5560 on OpenSolaris create the ultimate Directory Server

Sun Microsystems Directory Server Enterprise Edition 6.3 performance on X6270 (Nehalem)[Read More]

Tuesday May 20, 2008

The Hare and the Tortoise [X6250 vs T6320] or [INTEL XEON E5410 vs SUN UltraSPARC-T2 ]

The Hare and The Tortoise
View Benoit's profile on LinkedIn


"To win a race the swiftness of a dart ... Availeth not without a timely start"

LeLievreEtLaTortue
 

The tree on yonder hill we spy [Sun Blade 6000
Modular Systems]
The Sun Blade 6000 chassis support up to ten blades in a ten rack-unit chassis and is extremely popular due to its versatility. In fact, you can test your application today on four different chips within the same chassis. (UltraSPARC-T1 [T6300], UltraSPARC-T2 [T6320], AMD Opteron dual-core [X6220] and INTEL Xeon dual-core and quad-core [X6250]. While the Opteron and T1 blades have performance characteristics well defined by now, I was really curious to see how the new T2 blade will perform when compared to the Xeon Quad-Core.

A grain or two of hellebore [Chips & Systems]
In term of chips details, the T2 and Xeon are diverging. The three key differences are the total number of strands [16 times for the T2], the CPU frequency [1.66 times more for the Xeon] and the L2 cache size [3 times more for the Xeon].

This simple table illustrate their key characteristics :

Feature
INTEL Xeon E5410
SUN UltraSPARC-T2
Process
45 nm
65 nm
Transistors
820 million
500 million
Cores
4
8
Strands/core
1
8
Total #strands
4
64
Frequency
2.33Ghz
1.4Ghz
L1 cache
16KB I. + 16KB D.
16KB I. + 8KB D.
L2 cache
12 MB
4 MB
Nominal Power
80 W
95 W

This table makes it clear that predicting response time or throughput  delta between this two chips is a risky endeavor !

X6250T6320


Following this two pictures [X6250 and T6320], here is our hardware list :

Role Model
System clock
Sockets@freq
RAM
T2 blade
T6320
N/A
1@1.4Ghz
32 GB
Xeon blade
X6250
1333 Mhz
2@2.33Ghz
32 GB
Console
X4200
1000 Mhz
2@2.4Ghz
8 GB


I dare you to the wager still [Benchmarks]
I ran several benchmarks (including Oracle workloads) on all type of blades, but for the purpose of this article I will present only the two simple micro-benchmarks iGenCPU and iGenRAM.

The iGenCPU benchmark is a JavaTM-based CPU micro-benchmark used to compare the CPU performance of different systems. Based on a customized Java complex number library, the code is computing Benoit Mandelbrot's highly dense fractal structure using integer and floating-point calculations. (50%/50%) The simplicity of the code as well as its non-recursivity allow a very scalable behavior using less than 128 Kb of memory per thread. The exact throughput in number of fractals per second and average response times are reported and coalesced for each scalability level.

The iGenRAM benchmark is based on the California lotto requirements. The main purpose of this workload is to measure multi-threaded memory allocation and multi-threaded memory searches in Java. The first step of the benchmark is for each thread to allocate 512 Megabytes of memory in a 3-dimensional integer arrays. The second step is to search through this memory to determine the winning tickets. The exact throughput in lotto tickets per millisecond as well as the average allocation and search time are reported and coalesced for each scalability level.

 For this test, we used Solaris 10 Update 4 and Java version 1.6.1.

And list wich way the zephyr blows [Results]

Here are the iGenCPU throughput & response time :

iGenCPU_blade

Notes :

1-The Hare [X6250] is starting very fast but gets tired at 8 threads and really slow down at 12 threads
2-The Tortoise [T6320] reach more than twice the throughput of the Hare at 60 threads.
3-Single threaded average transaction response time is two times better on the Hare.

Now let's look at the iGenRAM results :

iGenRAM_blade.


Notes :

1-Phenomenal memory throughput of the Hare [X6250] at low level of threads. But in peak, the Tortoise [T6320] achieve 11% more throughput
2-When the Hare is giving up (~7 threads), the Tortoise is just warming up, reaching its peak throughput at about 40 threads.
3-Single-threaded, it takes 9 ms to allocate 512 Mb on the Hare, 33 ms to do the same thing on the Tortoise.
4-Single-threaded, it takes 5 ms to search through 512 Mb on the Hare, 34 ms to do the same thing on the Tortoise.


Conclusion

The race is by the tortoise won.
Cries she, "My senses do I lack ?
What boots your boasted swiftness now ?
You're beat ! and yet you must allow,
I bore my house upon my back."

See you next time in the wonderful world of benchmarking....
Special thanks to Mr Jean De La Fontaine [1621-1695]


<script src="http://www.google-analytics.com/urchin.js" type="text/javascript"> </script> <script type="text/javascript"> _uacct="UA-917120-1"; urchinTracker(); </script>
About

mrbenchmark

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
News
Blogroll
deepdive

No bookmarks in folder