Monday Oct 13, 2008

Zambezi Architecture

Today we are launching our newest CMT server, the T5440. This server is a monster of performance and scalability. It can have up to four T2 Plus processors, 1.2GHz or 1.4GHz in a 4 Rack Unit (RU) chasis.

The glue for this system is a new ASIC from Sun code named Zambezi. The Zambezi is a coherency hub that enables the UltraSPARC T2 Plus processor to scale from 1 socket to 4 socket systems. The official name of the ASIC is the UltraSPARC T2 Plus XBR but this is a bit of a mouthful so we will stick with Zambezi

The main functions of the Zambezi are:
  • It broadcasts snoop requests to all sockets

  • It serializes requests to the same address

  • Consolidates all snoop responses

The Zambezi uses the same coherence protocol as the T5140/T5240 which I have described in a previous blog Communication is over point to point serial links that are implemented on top of an FBDIMM low level protocol. The full implementation was described in a paper at the 2008 IEEE Symposium on High-Performance Interconnects

CMT systems require large amount of memory bandwidth in order to scale. This was one of the main rationales for moving to FBDIMM with the UltraSPARC T2 as it provides 3x the memory bandwidth over the DDR2 interface used in UltraSPARC T1. The biggest challenge for Zambezi was to enable this bandwidth across 4 sockets and to avoid being the bottleneck to scaling.

Coming out of each UltraSPARC T2 Plus processor are 4 independent coherence planes. The T2 plus has 8 banks of L2 cache and each plane is responsible for the traffic from two of these banks. The plane is identified by two bits (12 and 13) of the Physical address. There are 4 Zambezi hubs in the system, each handling a single coherence plane. Each Zambezi is connected to each of the four T2 Plus processors over four separate point-to-point serial coherence links Because planes are independent there are no connections between the Zambezi chips.

A diagram of the architecture is shown on the right

The primary goal of Zambezi was to have the minimum latency for data crossing the ASIC. The final Zambezi latency achieved was a mere 33.1ns which exceeded our expectations. Approximately a third of the 33ns is serialization/deserialization overhead, and another third is link-layer overhead (framing composition/decomposition, CRC check/generate, etc.)

Note the Zambezi is involved even for local access to memory in order to resolve conflicts. The local memory latency, therefore, for the T5440 which is 229ns is slightly higher than the T5140 and T5240. The remote memory latency for the T5440 is 311ns. This makes the T5440 a NUMA machine but not highly so.

The bandwidth achieved by the Zambezi is extremely impressive. The theoretical bandwidth from our simulations is 84GB/s read, 42GB/s write and 126GB/s combined.

Translating these speeds and feeds to the real world the T5440 has achieved many world record benchmark results. These results all demonstrate the highest levels of throughput and scalability. Zambezi is THE key component in this scalability.

The ASIC is in 65nm technology from Texas Instruments, has about 3.6 million gates, a die size of about 79 square mm and runs at 800MHz. A photo on the right is the Zambezi floorplan.

T5440 Architecture

Today we are launching our newest CMT server, the T5440. This server is a monster of performance and scalability. It can have up to four T2 Plus processors, 1.2GHz or 1.4GHz in a 4 Rack Unit (RU) chasis. The design is modular enabling 1, 2, 3 and 4 processors configurations. The system scales to 32 cores, 256 threads and 512 GB of memory.

The diagram on the right shows the architecture. Each UltraSPARC T2 Plus processor has 4 coherency planes.

On a T5440 there are four high speed hub chips, called Zambezi, running at 800MHz which connect each of the four coherency planes. Each UltraSPARC T2 Plus processor communicates with all others in the system via the Zambezis. Each processor has its own local memory and can access all the remote memory of the other processors via the Zambezis. Thus the memory in the system scales as we add processors. Each processor also has its own integrated PCI-Express x8 link so the I/O bandwidth also scales as we add processors.

The scalability advantage of this architecture was highlighted at the 2008 IEEE Symposium on High-Performance Interconnects. there is also a set of slides at

The design of the T5440 system itself is very different to our previous systems which have all been traditional 1U and 2U servers.

The T5440 is 4 RU and the processors and memory are on daughter cards. These daughter cards plug into a motherboard. The motherboard contains the Zambezis and I/O subsystem. The daughter cards are configured in groups of two. One cpu card and one memory card per group. The memory cards are optional however as the cpu cards has enough DIMM slots for a minimum memory config

The photo on the right shows a cpu daughter card which contains one UltraSPARC T2 processor and slots for 4 FBDIMMs. The UltraSPARC T2 Plus has 2 FBDIMM channels, 2 branches per channel and the DIMM slots on the cpu daughter card are directly connected to these 4 branches.

The memory board as shown in the photo on the left contains 12 slots for FBDIMM memory. The memory daughter card extends the 4 FBDIMM branches from the associated UltraSPARC T2 processor. Each branch can be extended by up to 3 more DIMMs.

DIMMs come in 2GB, 4GB and 8GB sizes and run at 667MHz just like the T5140 and T5240 servers.

The minimum requirement is 4 DIMMs per processor. Other options are 8 or 16 DIMMs per processor.

Currently all processors must have the same amount of memory but each processor can achieve this with different sized DIMMs

From these photos you can see that the connectors on the cpu and memory daughter cards are different. The cards can only go in preassigned slots. This can be seen from the photo of the motherboard, the cpus plug into the longer slots and the memory into the shorter ones.

Slots must be filled in the following order. CPU/MEM pair 0 and pair 1 followed by pair 2 and pair 3. Standing in front of the server the slots are numbered.

front left 3 - 1 - 2 - 0 front right.

The motherboard photograph also shows the 8 PCI-E slots, although all are x8 electrically two of the slots (the longer ones) have x16 physical conectors.

Note PCI-E cards on the T5440 plug in vertically, this is a change from the T5120, T5220, T5140 and T5240 servers which used risers and installed PCI-E cards horizontally.

Note also two of the PCI-E slots have a second smaller slot next to them. These are the XAUI slots. Sun supplied XAUI cards plug into these slots and provide 10Gig networking.

The diagram on the right shows a fully configured T5440 with all its daughter cards and how these are connected to the Zambezis and I/O system. The beauty of this physical design is upgradability. You can start with a 1 cpu configuration and add extra compute and memory as required

There needs to be 'blanks' in place of empty slots that ensure appropriate airflow.

A top view of a fully loaded T5440 is shown in the photo at the right. From bottom to top are the 4 large fans for cooling, the row of cpu and memory daughter cards, and the 8 PIC-E I/O slots. The separate circuit board is the Service Processor for the system which also plugs in as a daughter card.

In the photo on the left we see the T5440 from the rear. Here you see the 4 power supplies each of which is 1,120 watts. To power a fully loaded T5440 with 4 cpus, 4 disks and 512TB of memory requires 2 supplies. Four supplies are required for full redundancy in this configuration.

Also note metal fillers that cover the 8 PCI-E slots, the built in 4x1Gig copper network connections, 100Mb and serial links to the Service processor, USB etc. Also in this photo you can see one of the 4 fans pulled out of the system.

Finally looking at the front of the system we see the 4 built-in disks. These are SAS and can be either 73GB or 146GB. There is also a DVD if required.



« October 2008