Welcome to Fishworks!


My role at Fishworks has been to enhance the kernel to enable features like hot plug, multipathing, clustering and device management in the Sun Storage 7000 series, particularly the 7410. These features are the sort of thing you should just expect in an enterprise grade storage product. You need to be able to dynamically change your storage hardware and withstand component failures – transparently. Ultimately these kernel changes will go back into generic Solaris to support Sun's J4x00 storage products, but for the cool features you'll want a 7000!

The past couple of years have been an amazing time for me. I've worked with the most talented, motivated and passionate team of engineers creating a great product line. There have been many long days and nights, with all the thrills and spills you'd expect, so I need to thank a few people: the Fishworks team, our testers (particularly Michael Harsch), Javen Wu, and most of all - our families.

Now that I can tell you what I've been working on, here are some of the details.

Talking to the disks:

Fishworks appliances (as we call them) utilise a combination of SAS and SATA disks, both traditional rotating magnetic disks and Solid State Disks, with the hosts using LSI 106xE series HBA's and onboard controllers. These controllers have the ability to communicate with both SAS and SATA disks, and are capable of providing both internal and external storage connections. This line of controllers is also capable of performing on-chip hardware RAID and even multipathing, but we don't use either of those features in our product (for a number of good reasons), it's all done in Solaris.

A heavily modified version of the mpt(7d) driver is used for these controllers and the associated parallel SCSI controller - the 1030. The design goal was to enrich the driver and frameworks to handle cascaded JBOD's in a multipathed, hot pluggable environment. mpt already handled parallel SCSI, hardware RAID, SAS, SATA, and was recently modified to support multipathing for the 2530 SAS arrays, so it has a long history.

Apart from the groundbreaking use of SSD's, these appliances provide bulk data storage using large numbers of disks. The important factor being the disks are connected as JBOD's and directly attached disks, and not using complicated hardware RAID and storage controllers. The significance is the reversing of the trend where operating systems communicate with a small number of storage targets (with a number of disks hidden behind enclosure controllers) to an environment where the operating system directly manages all of the individual disks. From a driver perspective, this creates extra challenges to maintain performance, particularly when using commodity parts.

J4x00 JBOD architecture:

While SAS disks are capable of supporting multiple paths, SATA disks do not. To overcome this, transparent active/active SATA multiplexers are attached to each disk, this provides two SATA connections per disk. While the mux is technically a single point of failure for access to the disk, there are individual muxes for each disk, and the use of ZFS provides protection against individual disk and mux failures. The internal connection of disks in the JBOD is a star topology with an LSI SAS expander at the centre, these expanders can be thought of the functional equivalent of an ethernet switch. The topology is duplicated in the JBOD to provide multiple paths, with two host side connections of each disk mux connecting to different expanders. The external SAS ports also connect to these expanders. The connections between the muxes and expanders are one phy wide, while each external JBOD connections are 4 phys wide.

A “novel” aspect of SAS is STP, the SATA Tunnelling Protocol – it allows a SATA device connection to a SAS port. To the operating system, SATA disks looks similar to SAS disks. The HBA translates SCSI commands (used for communicating with SAS disks) to SATA commands and tunnels them over the SAS fabric. The expander removes the tunnel and speaks native SATA to the mux. Another role of the expander is to co-ordinate requests from multiple initiators (typical of clustered environments). The current generation of LSI expander only provides single affiliations, simplistically meaning only one initiator is allowed to communicate with each disk, attempts to access the disk from another initiator, even from the same host, will be rejected – this is per path. Different hosts can still access the same disk via different paths hence, like regular SAS disks, access needs to be co-ordinated. Keith's clustering does this for you!

Not only do the expanders provide access to the disks, they also provide SMP and SES services to interrogate and control the JBOD enclosure, this all happens in-band on the SAS channel used for data.


Regarding your paragraph on the J4x00 series, does that mean that a J4200 with dual controllers can be used in a Solaris Cluster (two node) configuration with SATA disks? Based on http://opensolaris.org/jive/thread.jspa?messageID=308495&tstart=0, I gather that SAS disks work, but it's unclear if SATA disks will also work.

Posted by William Yang on November 29, 2008 at 04:53 AM EST #

Hi William,

I'm not part of the Solaris Cluster team, but my guess is that it wouldn't work, or not yet anyway! There are a lot of changes to mpt and some related kernel pieces in the 7000 product that allows us to do 7410 clustering with SATA disks, but these changes aren't in the main Solaris source base yet. After those changes go back, it would depend on the Solaris Cluster product team. i.e. it would be non-trivial.

The issue is that SATA disks don't provide the SCSI PGR support for disk reservation and the impact of affiliations cause issues for shared storage. If a single JBOD controller (and by implication a single path) were dedicated per cluster node, it would be possible to get past the affiliation issue - but that still leaves you to emulate the PGR support, and you don't have the benefits of multiple paths.

Posted by Greg Price on December 01, 2008 at 04:26 AM EST #

Hi Greg,

Thanks for the info. So if each node in a two node cluster has just one connection to the JBOD (node A with one connection to controller A and node B with one connection to controller B), do you know if the J4000 hardware or the SATA disks can support SCSI-2 (non-PGR) reservations? It might not be multipath, but we're trying to replace an older SCSI array without multipath anyways.

Do you know where I might be able to get some help from the Solaris Cluster team?

Posted by William Yang on December 01, 2008 at 05:13 AM EST #

Hi William,

I don't think it supports reservations. I couldn't see mention with a quick scan of the SAT-2 spec I had handy. It's a tall order because the SATA disk only thinks it has one connection, but there is a mux front ending that to provide the dual paths, then the expanders front ending that.

To be honest, I think your choices are going to be either go with a supported array (I assume the 25xx arrays would be supported), or go with the 7410 product - but I guess it depends what sort of clustering you're trying to do. i.e. local application or data service.

I don't have a specific contact in the Solaris Cluster team - I'd need to hunt around.

Posted by Greg Price on December 01, 2008 at 09:14 AM EST #

thanks for the interesting info.

Posted by unixfoo on December 02, 2008 at 11:20 AM EST #


Since I work in the Sun Cluster organization, I can answer the recent questions.
The Sun Cluster 3.2 update 2 release that will ship in early 2009 contains
a new feature called "Software Quorum". We are aware that there are storage
devices, such as SATA and SSD, that do not support SCSI-2 and also do not
support SCSI-3 reservation related operations. "Software Quorum" emulates
entirely in software the SCSI-3 Persistent Group Reservations. All storage devices
supported by a Solaris disk driver have a small area reserved for cluster use.
Thus we can do persistent emulation. This gives Sun Cluster the ability
to support any shared disk (or shared storage that acts like a disk).

Sun Cluster provides two main features to support shared storage.
1) The shared storage device can be a quorum device. "Software Quorum"
protocol enables any shared disk to be a quorum device.
2) Sun Cluster provides fencing. SCSI-3 provides strong fencing.
SCSI-2 provides a less robust form of fencing. SCSI-2 has several features.
An important feature is that there is a thread that probes the device
periodically, and panics the node if the probe fails.
The probe fails when the node has been fenced from the device.
With "Software Quorum"
we have a similar thread that also probes the device and panics the
node if the probe fails. This provides a similar form of data integrity.

We have successfully tested "software quorum" with Sun Cluster and SATA
disks. We have a group that qualifies various storage products, so
you would have to check with them about which storage products have
been qualified and about requests to do so.

Posted by Ellard Roush on December 11, 2008 at 04:00 AM EST #

You mention that "transparent active/active SATA multiplexers are attached to each disk". Does that mean your JBOD units need special nonstandard disks with these attached or that these multiplexers are onboard, sitting on each sata port, so standard off-the-shelf sata disks can be used?

Posted by Jure Pecar on January 25, 2009 at 04:40 PM EST #

No, the disks are standard. Like most OEM's, we'll have Sun firmware images that have settings and labels set to Sun specification - which is the same for all disks from Sun, so we don't recommend or support putting your own disks in. This is because the product is essentially a bundle where other components may be expecting specific behavior based on settings we have, etc.

Depending on the JBOD product, the muxes are either on the JBOD system board, or part of the sled - they're not on the actual disk. The mux creates the illusion to the disk of only being connected to a single host controller.

Posted by Greg Price on January 26, 2009 at 12:48 AM EST #

Post a Comment:
  • HTML Syntax: NOT allowed

OS and storage musings, with a dash of other technologies for good measure.


Top Tags
« April 2014