By denissheahan on Oct 13, 2008
The glue for this system is a new ASIC from Sun code named Zambezi. The Zambezi is a coherency hub that enables the UltraSPARC T2 Plus processor to scale from 1 socket to 4 socket systems. The official name of the ASIC is the UltraSPARC T2 Plus XBR but this is a bit of a mouthful so we will stick with Zambezi
The main functions of the Zambezi are:
It broadcasts snoop requests to all sockets
It serializes requests to the same address
Consolidates all snoop responses
The Zambezi uses the same coherence protocol as the T5140/T5240 which I have described in a previous blog http://blogs.sun.com/deniss/. Communication is over point to point serial links that are implemented on top of an FBDIMM low level protocol. The full implementation was described in a paper at the 2008 IEEE Symposium on High-Performance Interconnects http://www.hoti.org/hoti16/program/2008slides/Session_Presentation/Feehrer_CoherencyHub_2008-08-27-09-51.pdf
CMT systems require large amount of memory bandwidth in order to scale. This was one of the main rationales for moving to FBDIMM with the UltraSPARC T2 as it provides 3x the memory bandwidth over the DDR2 interface used in UltraSPARC T1. The biggest challenge for Zambezi was to enable this bandwidth across 4 sockets and to avoid being the bottleneck to scaling.
Coming out of each UltraSPARC T2 Plus processor are 4 independent coherence planes. The T2 plus has 8 banks of L2 cache and each plane is responsible for the traffic from two of these banks. The plane is identified by two bits (12 and 13) of the Physical address. There are 4 Zambezi hubs in the system, each handling a single coherence plane. Each Zambezi is connected to each of the four T2 Plus processors over four separate point-to-point serial coherence links Because planes are independent there are no connections between the Zambezi chips.
A diagram of the architecture is shown on the right
The primary goal of Zambezi was to have the minimum latency for data crossing the ASIC. The final Zambezi latency achieved was a mere 33.1ns which exceeded our expectations. Approximately a third of the 33ns is serialization/deserialization overhead, and another third is link-layer overhead (framing composition/decomposition, CRC check/generate, etc.)
Note the Zambezi is involved even for local access to memory in order to resolve conflicts. The local memory latency, therefore, for the T5440 which is 229ns is slightly higher than the T5140 and T5240. The remote memory latency for the T5440 is 311ns. This makes the T5440 a NUMA machine but not highly so.
The bandwidth achieved by the Zambezi is extremely impressive. The theoretical bandwidth from our simulations is 84GB/s read, 42GB/s write and 126GB/s combined.
Translating these speeds and feeds to the real world the T5440 has achieved many world record benchmark results. These results all demonstrate the highest levels of throughput and scalability. Zambezi is THE key component in this scalability.
The ASIC is in 65nm technology from Texas Instruments, has about 3.6 million gates, a die size of about 79 square mm and runs at 800MHz. A photo on the right is the Zambezi floorplan.