Configuring and Optimizing Intel® Xeon Processor 5500 & 3500 Series (Nehalem) Systems Memory

The Memory Subsystem
An integrated memory controller and multiple DDR3 memory channels help the Intel® Xeon® 5500 and Intel® Xeon® 3500 processors provide high bandwidth for memory-intensive applications. DDR3 memory components offer greater density and run at higher speeds, but at lower voltage than previous generation DDR2 memories.

A typical DDR3 DIMM (with heat spreader)

Each processor has a three-channel, direct-connect memory interface and supports DDR3 memory from Sun in two speeds; 1066MT/s and 1333MT/s. When configuring system memory, it’s important to note that DIMMs may run at slower than individually rated speeds depending on a number of factors, including the CPU type, the number of DIMMs per channel, and the type of memory (speed, number of ranks, etc.). The speed at which memory will ultimately run is set by system BIOS at startup and all memory channels will run at the highest common frequency.

The maximum theoretical memory bandwidth per processor socket for each supported data rate is:

  • 1333 MT/s: 32 GB/s (10.6 GB/s per channel)
  • 1066 MT/s: 25.5 GB/s (8.5 GB/s per channel)
  • 800 MT/s: 19.2 GB/s (6.4 GB/s per channel)

Depending on the specific platform, Sun’s servers support either two or three registered ECC DDR3 DIMMs per channel in either 2GB, 4GB, or 8GB capacities, with 8GB RDIMMS available shortly after initial product release. The Sun Ultra 27 workstation can accommodate up to two unbuffered DDR3 ECC 1GB or 2GB DIMMs per channel to support densities ranging from 2GB (2x 1GB) to 12GB (6x 2GB) of memory.

Memory Population Guidelines
Each of the processors three memory channels is capable of supporting either two or three DIMM slots, enabling 6 or 9 DIMMs per processor respectively. Memory slots in each channel are color-coded to simplify identification: for server platforms blue for slot 0, white for slot 1 and black for slot 2 on systems supporting 3 DIMMs per channel; and for the Ultra 27 workstation black for slot 1, blue for slot 0 (see Figure 1). As a general rule to optimize memory performance, DIMMs should be populated in sets of three, one per channel per CPU.

  Figure 1 – DIMM & Channel Layout for 3 and 2 DIMMs per Channel Memory Configurations


A basic rule shared by all Intel® Xeon® Processor 5500 and Intel® Xeon® Processor 3500 platforms is that the farthest DIMMs from the CPU in each individual DDR3 channel need to be populated first, starting with the slot furthest from the CPU socket (i.e. the blue slot on servers, the black slot on a workstation). Ideally each channel should be populated with equal capacity DIMMs, and if possible, with the same number of identical DIMMs, which helps to make memory performance more consistent. However, DIMMs of different sizes (i.e. single vs. dual rank) can be installed in different slots within the same channel.

In a server with a single processor, the DIMM slots next to the empty CPU socket should not be populated.

Optimizing for Capacity
To design a configuration optimized for capacity, it is recommended that all slots are populated with the highest density DDR3 dual-rank DIMMs available for that specific system. Memory bus speed will be reduced to 1066MT/s when 2 DIMMs per channel are installed and to 800MT/s when three DIMMs per channel are populated regardless of whether DDR3-1066 or DDR3-1333 DIMMs are used.

Optimizing for Performance
Server configurations with optimal memory bandwidth can be achieved using the “Performance” class of Intel® Xeon® Processor 5500 & 3500 Series processors (see Tables 1 & 2 below) and memory components that run at 1333MT/s. Similarly, workstations will achieve highest memory performance with Intel® Xeon® 3500 processors that support DDR3-1333 as well.  A balanced DDR3 DIMM population is a key factor in achieving optimal performance.

Table 1 - Intel® Xeon® Processor 5500 Classes

Table 2 - Intel® Xeon® Processor 3500 Classes

To optimize a configuration for bandwidth, populate one identical dual rank DDR3 1333MT/s DIMM per channel. Use of single rank DIMMs will provide lower performance than dual rank modules because of an insufficient number of banks per channel available to the memory controller and the resulting underutilization of available bus bandwidth. Other factors that result in less than optimal memory performance include:

  • installing more than one DIMM per channel which restricts the maximum memory access speed to 1066MT/s or 800MT/s depending on whether there are two or three DIMMs per channel installed
  • an unbalanced DIMM population (i.e. one channel has more capacity than others)
  • when odd number of DIMM ranks per channel exists (i.e. mixing a 2GB single rank DIMM and a 4GB dual rank DIMM on each channel)

The Numbers
Below is a table showing how different DIMM configurations compare from a bandwidth perspective. The numbers provided are all relative to a DDR3-1333 capable processor configured with one dual-rank DIMM per channel. A homogeneous DIMM population is used in every case presented. SR = Single Rank DIMM, DR = Dual Rank DIMM

Table 3 - Relative Bandwidth Comparisons

Key takeaways from the above are:

  • for DIMM configurations that support both speeds, memory bandwidth is 5-8% higher with 1333 DIMMs than with 1066 DIMMs
  • for a given capacity, one dual rank DIMM per channel provides higher bandwidth performance than two single rank DIMMs per channel

Optimizing for Power
Following is an example of how different DIMM configurations compare from a memory power perspective. The power numbers provided are relative to a DDR3-1333 capable processor configured with one dual-rank DIMM per channel (bold). In each case the DIMMs are comprised of the same DRAM technology and density. It’s important to keep in mind that DRAM power requirements typically drop as silicon technology and process change and mature, so the table is only applicable when comparing like technologies. As above, homogeneous DIMM populations are evaluated in every case, and SR = Single Rank DIMM, DR = Dual Rank DIMM.

Table 4 - Relative DIMM Power Comparisons

From the table it can be determined that:

  • for a particular DIMM configuration and bus speed, DDR3-1333 DIMMs consume up to 6% less power than DDR3-1066 modules
  • for a given DIMM configuration, the incremental power required to operate DDR3-1333 DIMMs at 1333MT/s data rate vs. 1066MT/s is 4% or less
and somewhat less obvious but equally important:
  • a dual rank DIMM operating at 1333MT/s consumes less power than two single rank DIMMs at 1066MT/s

The data presented indicates there are trade-offs to be made regarding how to best configure a system with memory.  As each application will have unique requirements, processor memory bus speed capability, capacity, performance and power are all factors that must be considered as part of the process.  Sun's Intel® Xeon® Processor 5500 Series and Intel® Xeon® Processor 3500 Series platforms offer tremendous capability and flexibility to optimize performance and capacity tailored to a specific need.


Click here to return to the Nehalem index page.

Comments:

This is an excellent article. It summarizes neatly and graphically the bits & pieces I have been finding about Nehalem. Thank you.

What I have not found is what is the limit for memory ranking of each controller?
In large memory configurations, it might become a limiting factor.
What is the effect of, say, 4GB, quad-ranked, RDIMM modules using six RDIMMs per CPU(3/channel)? Can the memory controller handle 24 ranks at DDR3-800 or will it force the speed down to 533(or lower?) ...or maybe just fail with random memory errors and|or CPU shutdown?

Inquiring minds want to know(but Intel ain't saying ...), :).

Posted by Ric on May 06, 2009 at 04:51 PM EDT #

Ric, thanks for your comment!
Each of the three memory channels per processor is capable of selecting up to eight ranks. So your example of two quad rank DIMMs per channel is, per Intel spec, limited to DDR3-800 speed. A single quad rank DIMM has a DDR3-1066 speed limit.

Here's how the bandwidth numbers work out with quad rank (QR) DIMMs in the picture:
1x DR 1333 - 100% (baseline)
1x QR 1066 - 91%
2x DR 1066 - 90%
3x DR 800 - 74%
2x QR 800 - 71%

So you end up with a mere 1% boost by going with 1 QR over 2 DR, and as you can see for some reason 3x DR is slightly better than 2x QR, even though there are more ranks available (for interleaving, etc.).

john

Posted by John Nerl on May 08, 2009 at 01:10 AM EDT #

Excellent preformance and power guideline, Thank you.

Posted by Mark on May 12, 2009 at 03:01 PM EDT #

If it is not too much of trouble, can you give the power costs of these QR's configuration ?
Are those numbers derived from tests?

Thanks again.

Mark ( again )

Posted by Mark on May 12, 2009 at 03:38 PM EDT #

In your reply to Ric, "2x DR 1066 - 90%", in the table you gave above it is 92% of baseline. is it typo ? or ?

Thanks

Mark (again)

Posted by Mark on May 12, 2009 at 03:54 PM EDT #

Hi Mark - thanks for your comments.
You are correct - I incorrectly transposed the 2x DR 1066 relative performance number from the table into my comment. The comparison should look like this:

Here's how the bandwidth numbers work out with quad rank (QR) DIMMs in the picture:

1x DR 1333 - 100% (baseline)
1x QR 1066 - 91%
2x DR 1066 - 92%
3x DR 800 - 74%
2x QR 800 - 71%

What you might then ask is "why is 1x QR 1066 lower than 2x DR 1066?", to which I would reply that they're essentially the same, accountable by run to run variation.

And, to address your question about power comparisons for the same configs (theoretical):

1x DR 1333 - 100% (baseline)
1x QR 1066 - 129%
2x DR 1066 - 145-147%
3x DR 800 - 149%
2x QR 800 - 165%

Posted by John Nerl on May 13, 2009 at 03:38 AM EDT #

Hi John

Thanks again.

I have bookmarked here, wish to see more of your work.

Best

Mark

Posted by Mark on May 13, 2009 at 12:07 PM EDT #

hi there,

i'm a newbie n i have a one question:
i have a blade server using 1 processor E5430
each blade have 8 slots for memory which is dual channel.

which configuratin is better performance
4 x 4 GB FBD PC2-5300 (2 x 2 GB) Kit populated on all slots (8).
2 x 8 GB FBD PC2-5300 (2 x 4 GB) Kit only populated on 4 slots.

thanks before

Posted by Steve on May 20, 2009 at 02:06 PM EDT #

Steve,

The processor you refer to is not a member of the Nehalem family and has a completely different memory architecture, therefore the direct applicability of the information in this blog is likely to be near zero.

Since the configurations you list contradict themselves, it's not clear exactly what you're trying to compare (i.e. 4x 4GB populated on all slots (8) - is that 4 or 8?). For the record, you also need to specify how the DIMMs are constructed (# of ranks, etc.) for the comparison.

Sorry I can't be of more help. Why not upgrade to a Nehalem-based system?

john

Posted by John Nerl on May 21, 2009 at 02:09 AM EDT #

Wonderful explanation John. Much better than Intel's, in my opinion.
One thing I'm still wondering about: why bandwidth isn't linear with
clock speed, i.e. why "only" a 9% (or so) drop when going from 1333 to 1066
instead of 20% ?

Also, if I'm only interested in performance, and I have a dual processor Nehalem board (let's say 12 DIMM slots total), then what is the best way to implement 24GB, 36GB, or 48GB ? By "best way" I mean how many RDIMMs of each capacity, rank, and speed, placed in which slots, and what is the effective bandwidth ?

What seems frustrating is that after reading your illuminating explanation, I've come to the conclusion that adding more RAM (when you can live without it) is actually detrimental to performance, right ? We'll see from your answer to my above question if this is true.... maybe you have some clever "trick" ?

Thanks again.

Posted by Michelle on May 21, 2009 at 05:45 PM EDT #

...I forgot to mention: I don't think I can afford 8GB RDIMMs, so we have to limit the max capacity per DIMM to 4GB.

Thanks again.

Posted by Michelle on May 21, 2009 at 05:47 PM EDT #

Thanks Michelle,
There are many contributing factors that explain the bandwidth/frequency non-linearity. For instance, there are frequency-agnostic switching penalties (bus turn-around time, rank to rank, etc.) as well as things like DRAM access time and latencies that remain constant. A 1333 DIMM runs with a latency of 9-9-9 while at 1066 the latencies are only 7-7-7. Each of these numbers represent the number of clocks required, and when you do the math they work out to be nearly the same, thus contributing to the non-linearity. In fact, they end up being slightly higher for 1333 than for 1066!
Now to your performance/config question. A basic rule to keep in mind is that you always want to spread out your DIMMs across as many channels as possible and also keep a homogeneous population. If you meet that rule, then for a given bus speed and DIMM density, DIMMs with the most number of ranks are preferred.
For 24GB, populate one dual-rank 1333 4GB module per channel for both processors. BW will be ~35GB/s.
For 36GB, populate one dual-rank 1333 4GB DIMM per channel on one processor, and two dual-rank 1333 or 1066 4GB modules per channel on the other. It's a lopsided config but best from a memory bandwidth perspective. Achievable BW should be in the range of 31-35GB/s.
For 48GB, populate two dual-rank 4GB DIMMs per channel for both processors and you’ll get about 31-32GB/s BW.
And you’re right, adding more RAM can be detrimental but not in all cases, like if you have the $ to buy 8GB DIMMs or are starting from a non-optimal configuration or going from using one 2GB 1333 DIMM per channel to one 4GB 1333 DIMM per.
I hope this helps!

Posted by John Nerl on May 22, 2009 at 01:23 AM EDT #

Hi John

Do you have the figures of relative BW of a dual channels DDR-1333 dual rank and single rank?
I could not get this myself, as I do not have the Processor/1333 yet, only have Processors/1066.

Thanks

Mark

Posted by Mark on May 24, 2009 at 01:32 PM EDT #

Mark - here's how the numbers work out:

1x SR DPC @1333 on two channels - 69%
2x SR DPC @1066 on two channels - 66%
1x DR DPC @1333 on two channels - 82%
2x DR DPC @1066 on two channels - 64%

I would have expected that the 2x DR DPC @1066 bandwidth would be higher than the 2x SR DPC @1066 value but that's not the way it worked out. They're very close so perhaps chalk it up to run to run variation?

john

Posted by John Nerl on May 26, 2009 at 01:09 AM EDT #

Hi John

Thanks for giving the data.

I did a couples runs this morning, BW of 2x DR DPC @1066 is higher than 2x SR DPC @1066, about +2% on my run-environment. So it could be a run to run variation, as you pointed out.

But what if:
1x SR N/A N/A 1066 1066 33%
1x DR N/A N/A 1066 1066 36%
DR gains over SR: 9.1%

2x SR 2x SR N/A 1066 1066 64%
2x DR 2x DR N/A 1066 1066 66%
DR gains over SR: 3.1%

2x SR 2x SR 2x SR 1066 1066 91%
2x DR 2x DR 2x DR 1066 1066 92%
DR gains over SR: 1.1%

By just swapping 64% and 66%, it will come out a relative-ratio of 1.1%, 3.1%, 9.1%, it would explain that single, dual and triple channels impacts as well. I could not verify these numbers, even I got Nehalem EP/1333, I would not have your run-environment, and therefore it is not possible for me to get those perfect numbers, but I swapped it anyway, and accept it as “2x DR DPC @1066 is higher than 2x SR DPC @1066, it is a 66% and 64% with reference to the baseline”, and hope you do not mind of me doing this (your web will always be referenced).

Thanks.

Mark

Posted by Mark on May 26, 2009 at 05:23 PM EDT #

great! great! great!

Posted by JHIN on June 03, 2009 at 10:30 PM EDT #

For a 72GB situation, if we do 2DPC population pattern: "one DR 1066 4GB + one QR 1066 8GB" DIMMs for all the 6 channels, will these 1066 DIMMs run peak rate @ 1066MT/s or 800MT/s?

Posted by jsjs on June 24, 2009 at 10:43 PM EDT #

Populating a quad rank DIMM along with a single, dual or another quad rank module will result in the memory bus operating at 800MT/s.

Posted by John Nerl on June 25, 2009 at 12:40 AM EDT #

John very useful info. does Sun have any uniques in this space?

Posted by albie on July 09, 2009 at 07:50 AM EDT #

Not necessarily unique but certainly not mainstream, most of Sun's Nehalem products now support two DIMMs per channel at DDR3-1333. For more information, please see my blog update at: http://blogs.sun.com/jnerl/entry/update_to_configuring_and_optimizing

Posted by John Nerl on July 13, 2009 at 01:08 AM EDT #

I have the Dell t7500, with dual xenon 5580, dual graphics fx5800, with 96 gb quad rank, 8 gb dimms, 1066mghz.
I find the RAM operates at 800mghz, however whenI had 48 gb, 1333mghz, they operated at 1333.
I am told that the dual rank 8 gb, 1333, fully populated will operate at 1333.
Is this true, is dual rank the technologically advanced to quad, or is it the other way.

Posted by Ahhmed on July 19, 2009 at 03:44 AM EDT #

An unfortunate consequence of quad-rank DIMMs is that they need to be clocked down to 800MHz when two are populated in a channel in order to maintain signal integrity. Think of it like this: quad-rank DIMMs present 4 loads each to the memory data bus and dual-rank only 2 loads each, so you have double the bus loading with quad-rank modules installed.

You could say that dual-rank modules are technologically advanced over quad-ranks at a given density, because the dual-rank modules are built with DRAMs that are twice as dense (newer technology). Also, the dual-rank configuration will provide about 30% higher memory bandwidth than the quad-rank solution. The trade-off is price.

Posted by John Nerl on July 20, 2009 at 12:30 AM EDT #

Please note that an update to the original blog entry has been posted at:
http://blogs.sun.com/jnerl/entry/update_to_configuring_and_optimizing

Posted by John Nerl on July 20, 2009 at 12:32 AM EDT #

Thanks John for your excellent guidance.

In your answer to Michelle, you mention having 12 GB in one cpu (1 DR 4GB per channel) and 24 GB in the other (2 DR 4GB per ch) , to obtain 36 GB with good performance.
Do you think that the OS (let's say Solaris & Linux) and apps in general will handle this unbalanced memory config properly ? This is a bit mindbreaking to me.

Another question: What about mixing DIMM sizes in a channel ?
I'm considering having 1x DR 4GB (1333) + 1x DR 8GB (1033) per ch , to obtain 72 GB (pretty close to the traditional magic number of 64 GB).
Will this be better than 3x DR 4GB 1333 per channel (74% BW according to your table3)

Thanks!

Posted by Angel on September 10, 2009 at 02:58 AM EDT #

Hi Angel,

I'm not an OS expert but would expect that since Solaris and Linux are both NUMA-aware that should help with balancing things out. It's likely that some amount of performance loss will result in a heavily utilized system scenario, where remote latency will start coming into play. My comments to Michelle were more focused on local memory bus performance.

Mixing 1x DR 4GB (1333) and 1x DR 8GB (1333) per channel will result in the memory subsystem clocking down to 1066, while 3x DR 4GB 1333 per channel will run at 800. I'd expect the mixed configuration to provide somewhere around 85% relative bandwidth which is higher than the 3x DR per channel config.

Posted by John Nerl on September 10, 2009 at 04:17 AM EDT #

Hi John - thanks for this excellent in-depth article. 2 1/2 years on it still ranks top as still no-one seems to have come up with anything like it yet.

A quick question for you - do you know of any way to get a Sun Ultra 27 to accept 4GB DIMMs as well? What specs would those DIMMs have to be? 1066MHz ECC Registered dual-rank?

Thanks so much!

Posted by Christian Kroker on September 09, 2011 at 05:19 AM EDT #

Hi Christian, thanks for your kind words.

I can't recommend installing 4GB DIMMs into an Ultra 27 as I've never tested that configuration, but if you were to try them they should be unbuffered (not registered) ECC dual-rank, and either 1066 or 1333 speed rated.

Posted by John Nerl on September 13, 2011 at 04:05 AM EDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

John Nerl

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today