By relling on Jul 09, 2008
This morning, we announced the newest Just a Bunch of Disks (JBOD) storage arrays. These are actually very interesting products from many perspectives. One thing for sure, these are not like any previous JBOD arrays we've ever made. The simple, elegant design and high reliability, availability, and serviceability (RAS) features are truly innovative. Let's take a closer look...
Your Daddy's JBOD
In the bad old days, JBOD arrays were designed around bus or loop architectures. Sun has been selling JBOD enclosures using parallel SCSI busses for more than 20 years. There were even a few years when fibre channel JBOD enclosures were sold. In many first generation systems, they were not what I would call high-RAS designs. Often there was only one power supply or fan set, but that really wasn't what caused many outages. Placing a number of devices on a bus or loop exposes them to bus or loop failures. If you hang a parallel SCSI bus, you stop access to all devices on the bus. The vast majority of parallel SCSI bus implementations used single-port disks. The A5000 fibre channel JBOD attempted to fix some of these deficiencies: redundant power supplies, dual-port disks, two fibre channel loops, and two fibre channel hubs. When introduced, it was expected that fibre channel disks would rule the enterprise. The reality was not quite so rosy. Loops still represent a shared, single point of failure in the system. A misbehaving host or disk could still lock up both loops, thus rendering this "redundant" system useless. Fortunately, the power of the market drove the system designs to the point where the fibre channel loops containing disks are usually hidden behind a special-purpose controller. By doing this, the design space can be limited and thus the failure modes can be more easily managed. In other words, instead of allowing a large variety of hosts to be directly connected to a large variety of disks in a system where a host or disk could hang the shared bus or loop, array controllers divide and conquer this problem by reducing the possible permutations.
This is basically where the market was, yesterday. Array controllers are very common, but they represent a significant cost. The costs increase for high-RAS designs because you need redundant controllers with multiple host ports, driving the costs up.
Suppose we could revisit the venerable JBOD, but using modern technology? What would it look like?
SCSI, FC, ATA, SATA, and SAS
When we use the term SCSI, most people think of the old, parallel Small Computer System Interconnect bus. This was a parallel bus implemented in the bad old days with the technology available then. Wiggling wire speed was relatively slow, so bandwidth increases were achieved using more wires. It is generally faster to push photons through an optical fiber than to wiggle a wire, thus fibre channel (FC) was born. In order to leverage some of the previous software work, FC decided to expand the SCSI protocol to include a serial interface (and they tried to add a bunch of other stuff too, but that is another blog). When people say "fibre channel" they actually mean "serial SCSI protocol over optical fiber transport." Another development around this time was that the venerable Advanced Technology Attachment (ATA) disk interface used in many PCs was also feeling the strain of performance improvement and cost reductions.
Cost reductions? Well, if you have more wires, then the costs will go up. Connectors and cables get larger. From a RAS perspective, the number of failure opportunities increases. A (parallel) UltraSCSI implementation needs 68-pin connectors. Bigger connectors with big cables must be stronger and are often designed with lots of structural metal. Using fewer wires means that the connector and cables get smaller, reducing the opportunities for failure, reducing the strength requirements, and thus reducing costs.
Back to the story, the clever engineers said, well if we can use a protocol like SCSI or ATA over a fast, serial link, then we can improve performance, improve RAS, and reduce costs -- a good thing. A better thing was the realization that the same, low-cost physical interface (PHY) can be used for both serial SCSI and serial ATA (SATA). Today, you will find many host bus adapters (HBAs) which will support both SAS and SATA disks. After all, the physical connections are the same, it is just a difference in the protocol running over the wires.
One of the more interesting differences between SAS and SATA is that the SAS guys spend more effort on making the disks dual-ported. If you look around, you will find single and dual-port SAS disks for sale, but rarely will you see a dual-port SATA disk (let me know if you find one). More on that later...
Now that we've got a serial protocol, we can begin to think of implementing switches. In the bad old days, Ethernet was often implemented using coaxial cable, basically a 2-wire shared bus. All Ethernet nodes shared the same coax, and if there was a failure in the coax, everybody was affected. The next Ethernet evolution replaced the coax with point-to-point wires and hubs to act as a collection point for the point-to-point connections. From a RAS perspective, hubs acted similar to the old coax in that a misbehaving node on a hub could interfere with or take down all of the nodes connected to the hub. With the improvements in IC technology over time, hubs were replaced with more intelligent switches. Today, almost all Gigabit Ethernet implementations use switches -- I doubt you could find a Gigabit Ethernet hub for sale. Switches provide fault isolation and allow traffic to flow only between interested parties. SCSI and SATA followed a similar evolution. Once it became serial, like Ethernet, then it was feasible to implement switching. RAS guys really like switches because in addition to the point-to-point isolation features, smart switches can manage access and diagnose connection faults.
J4200 and J4400 JBOD Arrays
Fast forward to 2008. We now have fast, switchable, redundant host-disk interconnect technology. Let's build a modern JBOD. The usual stuff is already taken care of: redundant power supplies, redundant fans, hot-pluggable disks, rack-mount enclosure... done. The connection magic is implemented by a pair of redundant SAS switches. These switches contain an ARM processor and have intelligent management. They also permit the SATA Tunneling Protocol (STP) to move SATA protocol over SAS connections. These are often called SAS Expanders, and the LSISASx36 provides 36 ports for us to play with. SAS connections can be aggregated to increase effective bandwidth. For the J4200 and J4400, we pass 4 ports each to two hosts and 4 ports "downstream" to another J4200 or J4400. For all intents and purposes, this works like most other network switches. The result is that each host has a logically direct connection to each disk. Each disk is dual ported, so each disk connects to both SAS expanders. We can remotely manage the switches and disks, so replacing failed components is easy, even while maintaining service. RAS goodness.
As I mentioned above, SATA disks tend to be single ported. How do we connect to two different expanders? The answer is shown here. In the bill-of-materials (BOM) for SATA disks, you will notice a SATA Interposer card. This fits in the carrier for the disk and provides a multiplexor which will connect one SATA disk to two SAS ports. This is, in effect, what is built into a dual-port SAS disk. From a RAS perspective, this has little impact on the overall system RAS because the field replaceable unit (FRU) is the disk+carrier. We don't really care if that FRU is a single-port SATA disk with an interposer or a dual-port SAS disk. If it breaks, the corrective action is to replace the disk+carrier. Since each disk slot has point-to-point connections to the two expanders, replacing a disk+carrier is electrically isolated from the other disks.
What About Performance?
Another reason that array controllers are more popular than JBODs is that they often contain some non-volatile memory used for a write cache. This can significantly improve write-latency-sensitive applications. When Sun attempted to replace the venerable SPARCStorage Array 100 (SSA-100) with the A5000 JBOD, one of the biggest complaints was that performance was reduced. This is because the SSA-100 had a non-volatile write cache while the A5000 did not. The per-write latency difference was an order of magnitude. OK, so this time around does anyone really expect that we can replace an array controller with a JBOD? The answer is yes, but I'll let you read about that plan from the big-wigs...
I had intended to show some block diagrams here, but couldn't find any I liked that weren't tagged for internal use only. If I find something later, I'll blog about it.