Diagnosing FB-DIMM channel errors

The T5120/T5220 systems Sun recently announced are some of the first systems to come out of Sun using Fully Buffered DIMMs (FB-DIMMs). This new technology presented some new diagnosis challenges for us, but very recently, CR 6536482 - "diagnose FBR and FBU errors to branch" was putback into Solaris 10 (it's been in OpenSolaris for a few weeks now). A patch will be coming soon, and while I don't know the patch number, it'll almost definitely be an update to 119578.

Some very quick background on FB-DIMMs in laymans terms. An FB-DIMM contains an Advanced Memory Buffer (AMB) that sits between the memory controller and the memory module itself. Multiple FB-DIMMs are connected in a serial fashion, AMB-to-AMB.

In other words, a daisy chain. In T5120/T5220, the memory controller is in the UltraSPARC-T2 chip and we have up to four FB-DIMMs per branch.

So back to 6536482. On T5120/T5220, detection and reporting of FB-DIMM channel errors has been in place since day one. But particularly for recoverable errors, since this is new technology to Sun, no diagnosis is issued for these. We had to be careful with when to issue a diagnosis against the channel. It's cool that FB-DIMMs can tell us there's a problem with the channel, but the downside is that there is insufficient information available to determine where in the channel the problem exists. All we know is that something in the channel is broken. This par for the course with uncorrectable errors (an FBU ereport), but something new for us with correctable events (an FBR ereport). Care was taken to select the appropriate SERD values for these correctable FBR events, as the resulting diagnosis implicates everything in the FB-DIMM channel. On a fully loaded T5120/T5220, that equates to 4 DIMMs plus the motherboard (the T2 chip).

Anyway, lets see an example of a diagnosis of an FBR. As I usually do in my blogs, we'll look at this top-down, from the point of view of what you'd see on your system if such a diagnosis occurred. On the console, the following is reported:

SUNW-MSG-ID: SUN4V-8000-FJ, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Tue Oct 16 15:10:53 EDT 2007 PLATFORM: SUNW,SPARC-Enterprise-T5220, CSN: -, HOSTNAME: wgs48-100 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: d505971c-f3dd-e27e-9e05-faed155c39bd DESC: A problem was detected in the interconnect between a memory DIMM module and its memory controller. Refer to http://sun.com/msg/SUN4V-8000-FJ for more information. AUTO-RESPONSE: No automated response. IMPACT: System performance may be impacted.

Looking at the fmdump for the event, we see:

# fmdump -v -u d505971c-f3dd-e27e-9e05-faed155c39bd TIME UUID SUNW-MSG-ID Oct 16 15:10:53.2312 d505971c-f3dd-e27e-9e05-faed155c39bd SUN4V-8000-FJ 70% fault.memory.link-c Problem in: mem:///unum=MB/CMP0/BR0/CH0/D0 Affects: mem:///unum=MB/CMP0/BR0/CH0/D0 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704BB5053 :server-id=wgs48-100:serial=22ab471//motherboard=0/chip=0/branch=0/dram-channel=0/dimm=0 Location: MB/CMP0/BR0/CH0/D0 30% fault.memory.link-c Problem in: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704BB5053 :server-id=wgs48-100:serial=101083:part=541215101/motherboard=0 Affects: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704BB5053 :server-id=wgs48-100:serial=101083:part=541215101/motherboard=0 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704BB5053 :server-id=wgs48-100:serial=101083:part=541215101/motherboard=0 Location: MB

In my particular system, I only have a single DIMM in the branch, so only a single DIMM is listed as a suspect. But in general, all DIMMs in the affected branch are listed. Also, notice the certainty of the fault is weighted toward the DIMMs. The 70%/30% ratio always holds, irrespective of the number of DIMMs in the branch. When multiple DIMMs are indicted, the 70% certainty is evenly distributed across the DIMMs.

Finally, looking at the fmstat output and another fmdump command, there were multiple FBR events leading to the crossing of the SERD threshold and the diagnosis.

# fmstat -s -m cpumem-diagnosis NAME >N T CNT DELTA STAT branch_MB/CMP0/BR0_serd >14 30m 15 36131188301ns fire # fmdump -e -u d505971c-f3dd-e27e-9e05-faed155c39bd TIME CLASS Oct 16 15:09:49.5552 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:46.8111 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:46.3707 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:45.9953 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:45.6104 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:45.2361 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:44.7160 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:31.7703 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:31.4599 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:31.0751 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:30.6986 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:30.0915 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:29.5263 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:28.7693 ereport.cpu.ultraSPARC-T2plus.fbr Oct 16 15:09:16.8732 ereport.cpu.ultraSPARC-T2plus.fbr

A final note is that on T5120/T5220, the reporting of FBR events is filtered. If one examines the expected bit error rate (BER) for FB-DIMMs, the bandwidth of the memory channels, and so forth the math tells us that we can expect an FBR roughly once per second (yes, once per second) in worst case conditions. We filter to avoid inundating the error logs with FBR events. The SERD thresholds we see in fmstat output are a result of calculations of worst case coupled with emperical testing results, and then accounting for the filtering done with the reporting.



Post a Comment:
Comments are closed for this entry.



« April 2014