By Greg Price on Nov 13, 2008
My role at Fishworks has been to enhance the kernel to enable features like hot plug, multipathing, clustering and device management in the Sun Storage 7000 series, particularly the 7410. These features are the sort of thing you should just expect in an enterprise grade storage product. You need to be able to dynamically change your storage hardware and withstand component failures – transparently. Ultimately these kernel changes will go back into generic Solaris to support Sun's J4x00 storage products, but for the cool features you'll want a 7000!
The past couple of years have been an amazing time for me. I've worked with the most talented, motivated and passionate team of engineers creating a great product line. There have been many long days and nights, with all the thrills and spills you'd expect, so I need to thank a few people: the Fishworks team, our testers (particularly Michael Harsch), Javen Wu, and most of all - our families.
Now that I can tell you what I've been working on, here are some of the details.
Talking to the disks:
Fishworks appliances (as we call them) utilise a combination of SAS and SATA disks, both traditional rotating magnetic disks and Solid State Disks, with the hosts using LSI 106xE series HBA's and onboard controllers. These controllers have the ability to communicate with both SAS and SATA disks, and are capable of providing both internal and external storage connections. This line of controllers is also capable of performing on-chip hardware RAID and even multipathing, but we don't use either of those features in our product (for a number of good reasons), it's all done in Solaris.
A heavily modified version of the mpt(7d) driver is used for these controllers and the associated parallel SCSI controller - the 1030. The design goal was to enrich the driver and frameworks to handle cascaded JBOD's in a multipathed, hot pluggable environment. mpt already handled parallel SCSI, hardware RAID, SAS, SATA, and was recently modified to support multipathing for the 2530 SAS arrays, so it has a long history.
Apart from the groundbreaking use of SSD's, these appliances provide bulk data storage using large numbers of disks. The important factor being the disks are connected as JBOD's and directly attached disks, and not using complicated hardware RAID and storage controllers. The significance is the reversing of the trend where operating systems communicate with a small number of storage targets (with a number of disks hidden behind enclosure controllers) to an environment where the operating system directly manages all of the individual disks. From a driver perspective, this creates extra challenges to maintain performance, particularly when using commodity parts.
J4x00 JBOD architecture:
While SAS disks are capable of supporting multiple paths, SATA disks do not. To overcome this, transparent active/active SATA multiplexers are attached to each disk, this provides two SATA connections per disk. While the mux is technically a single point of failure for access to the disk, there are individual muxes for each disk, and the use of ZFS provides protection against individual disk and mux failures. The internal connection of disks in the JBOD is a star topology with an LSI SAS expander at the centre, these expanders can be thought of the functional equivalent of an ethernet switch. The topology is duplicated in the JBOD to provide multiple paths, with two host side connections of each disk mux connecting to different expanders. The external SAS ports also connect to these expanders. The connections between the muxes and expanders are one phy wide, while each external JBOD connections are 4 phys wide.
A “novel” aspect of SAS is STP, the SATA Tunnelling Protocol – it allows a SATA device connection to a SAS port. To the operating system, SATA disks looks similar to SAS disks. The HBA translates SCSI commands (used for communicating with SAS disks) to SATA commands and tunnels them over the SAS fabric. The expander removes the tunnel and speaks native SATA to the mux. Another role of the expander is to co-ordinate requests from multiple initiators (typical of clustered environments). The current generation of LSI expander only provides single affiliations, simplistically meaning only one initiator is allowed to communicate with each disk, attempts to access the disk from another initiator, even from the same host, will be rejected – this is per path. Different hosts can still access the same disk via different paths hence, like regular SAS disks, access needs to be co-ordinated. Keith's clustering does this for you!
Not only do the expanders provide access to the disks, they also provide SMP and SES services to interrogate and control the JBOD enclosure, this all happens in-band on the SAS channel used for data.