Wednesday Aug 31, 2005

More on Blocks

A few weeks ago I was blogging about how block protocols like SCSI were designed around the on-disk sector format and limited intelligence of 1980's disk drives. Clearly, if we were starting from a clean sheet of paper today to design a storage interconnect for modern intelligent storage devices, this is NOT the protocol we would create.

The real problem though isn't just that the physical sector size doesn't apply to today's disk arrays. The problem today has more to do with the separation of the storage and the application/compute server. Storage in today's data-centers sits in storage servers, typically in the form of disk arrays or tape libraries which are available as services on a network, the SAN, and used by some number of clients - the application/compute servers. As with any server, you would like some guarantee of the level of service it provides. This includes things like availability of the data, response time, security, failure and disaster tolerance, and a variety of other service levels needed to insure compliance with laws for data retention and to avoid over-provisioning.

The block protocol was not designed with the notion of service levels. When a data client writes a collection of data, there is no way to specify to the storage server what storage service level is required for that particular data. Furthermore, all data gets broken into 512-byte blocks so there isn't even a way to identify how to group blocks that require a common service level. The workaround today is to use a management interface to apply service levels at the LUN level which is at too high a level and leads to over-provisioning. This gets really complicated when you factor in Information Lifecycle Management (ILM) where data migrates and gets replicated to different classes of storage. This leads to highly complex management software and administrative processes that must tie together management APIs from a variety of storage servers, operating systems, and database and backup applications.

If we were starting from a clean sheet of paper today to design a storage interconnect we would do a couple of things. One, we would use the concept of a variable sized data Object that allows the data client to group related data at a much finer granularity then the LUN. This could be an individual file, or a database record, or any unit of data that requires a consistent storage service level and that migrates through the information lifecycle as a unit. Second, each data object would include metadata - the information about the object that identifies what service levels, access rights, etc are required for this piece of data. This metadata stays with the data object as it migrates through its lifecycle and gets accessed by multiple data clients.

Of course there are some things about today's block protocols we would retain such as the separation of command and data. This allows block storage devices and HBAs to quickly communicate the necessary command information to set up DMA engines and memory buffers to subsequently move data very efficiently.

Key players in the storage industry have created just such a protocol in the ANSI standards group that governs the SCSI protocol. The new protocol is called Object SCSI Disk (OSD). OSD is based on variable-sized data object which include metadata and can run on all the same physical interconnects as SCSI including parallel SCSI, Fibre Channel, and ethernet. With the OSD protocol, we now have huge potential to enable data clients to specify service levels in the metadata of each data object and to design storage servers to support those service level agreements.

I could go on for many pages about potential service levels that can be specified for data objects. They cover performance, insuring the right availability, security, including access rights and access logs, compliance with data retention laws, and any storage SLAs a storage administrator may have. I'll talk more about these in future blogs.

Wednesday Aug 10, 2005

Sun Makes Strong Showing at iSCSI Plugfest

SAN JOSE - Sun made a strong showing at last week's iSCSI Plugfest in San Jose. Sun brought along Sparc and x64 servers running advance copies of Solaris 10 Update 1 which includes their new iSCSI initiator driver. Although update 1 is not released yet, it's clear that the iSCSI stack in update 1 is a mature driver ready for the most demanding workloads. Sun also brought its automated Java Interoperable Storage Test Suite (JIST). JIST runs an extensive suite of hundreds of protocol compliance tests that fully exercise Fibre Channel and iSCSI storage devices.

Sun successfully ran its iSCSI driver and test suite against arrays from various vendors. Solaris ran well and the test suite found a variety of incompatibilities in some of the arrays during boundary and error-case testing. There were several requests for copies of JIST that array engineers could use to continue verifying iSCSI protocol compliance back in their own labs.

Although Sun has been criticized for being late to the iSCSI market, it's clear they have been working hard and when Solaris 10 update 1 releases, they will have an iSCSI stack with all the availability, reliability, and open-standards compliance you would expect from a Solaris server. Now it looks like it's the array vendors that need to catch up.

Monday Aug 01, 2005

Why blocks?

We've been doing a lot of thinking lately about the blocks in block storage. At some level blocks make sense. It makes sense to break the disk media into fixed-size sectors. Disks have done this for years and up until the early 1990s, disk drives had very little intelligence and could only store and retrieve data that was pre-formatted into their native sector size. The industry standardized on 512-byte sectors and file systems and I/O stacks were all designed to operate on these fixed blocks.

Now fast-forward to today. Disk drives have powerful embedded processors in integrated circuits with wasted real-estate where more could be added. Servers use RAID arrays with very powerful embedded computers that internally operate on RAID volumes with data partitioning much larger than 512 byte blocks. These arrays use their embedded processors to emulate the 512-byte block interface of a late 1980s disk drive. Then, over on the server side, we still have file systems mapping files down to these small blocks as if IT were talking to an old drive.

This is what I'm wondering about. Is it time to stop designing storage subsystems that pretend to look like an antique disk drive and is it time to stop writing file systems and I/O stacks designed to spoon-feed data to these outdated disks.

Sunday May 01, 2005

Does what works with what work?

Today I want to ask a question to those of you managing SANs in your datacenter. My question is how useful are interoperability support matrices, or what-works-with-what matrices for storage network components? Do they help?

I don't manage a SAN. As a traveling engineering manager, my day-to-day exposure to networking usually involves the two network ports on my notebook computer. It came with two built-in network interfaces. One, a traditional ethernet NIC, and the other an 802.11 wireless port.

This notebook didn't come with a WWWW for either network port, but I wonder, what would I do if there were? What if, for example, it said the ethernet port only supported talking to web servers running particular NICs under a certain versions of Linux? Would I have to send a note to Google to check if their hardware is on my matrix? If it wasn't, would I have to buy a separate NIC just for visiting What if the documentation for the wireless port said it only worked with certain models of Linksys routers with specific versions of firmware? Would I have to check with the person at the counter of my local coffee shop what firmware was running in their wireless hotspot?

I recently talked to the storage administrators at a large financial company. Their SAN covered at least two floors of a large building. Their storage network included a variety of servers running most of the major OS's in use today, storage from almost every major storage vendor, and three brands of switches. When they installed a new server, its SNIC had to just plug into this storage network and use whatever storage the application demanded. In practice, they couldn't use the WWWW either.

So, let me know. Does the WWWW help? My email address is in the upper right corner of the page.

Monday Apr 25, 2005

What's a SNIC?

A SNIC is a Storage Network Interface Card. It's how you connect a compute server to the storage network.

Direct-attached storage is connected to compute servers through Host Bus Adapters (HBAs), so called, because they adapt one bus to another. Servers have an internal bus, such as PCI, designed with one set of characteristics - short distance, word addressible, short quick data bursts, etc. On the other hand, external storage busses such as SCSI are designed for longer cable lengths, longer bursts of streamed data, etc. The HBA adapts the characteristics of the external storage bus to the internal one.

Now, go to many of today's large data centers that have over a hundred individual storage devices connected by a complex network with over a thousand ports, dozens of switches and miles of FC or ether network cabling. These are true networks running network protocols. How do you connect compute servers/data clients to these networks? With a SNIC. What else?

Friday Apr 22, 2005

Great technical insights on storage network technology

For great technical insights on storage networking and related technology trends, see Richard McDougall's weblog

Richard is co-author of the Solaris Internals book and has recently been working on storage network technology. Richard is much smarter than I am about how the technology really works.

This is only a test

This is a test of MarsEdit, a weblog editor for Macs. It is available HERE. MarsEdit simplifies creation of HTML for weblog postings from OS X.

We now return to our regularly scheduled programming....

Wednesday Apr 20, 2005

Are we really open?

Yes,we are very serious about open standards and we are actively driving creation of new open storage management standards. The forum for this is the Storage Networking Industry Association (SNIA) at

When we created native multipathing for Solaris, we recognized that there was no standard way to monitor and manage multipathing so we initiated creation of a SNIA MP-API as part of the OS-Attach Technical Workgroup in SNIA. Sun engineer Paul von Behren has served as editor for this API which is now nearing final draft. You can download a copy from the SNIA website here:

In addition to putting an iSCSI driver into Solaris, we have also been participating in SNIA's IP Storage Technical Workgroup since its inception. This workgroup is chartered to define standard solutions for management of IP based storage networks. Sun storage engineer John Forte is helping edit the iSCSI Management API (IMA), a set of C programmatic interfaces for host based management of iSCSI initiators and discovery of iSCSI storage devices. Sun will be providing the Solaris version of IMA later this year. This same SNIA workgroup has also played a key role in defining the SMI-S iSCSI profiles that enable CIM based management of iSCSI initiators and storage. A draft of this spec is also available at:

In addition to creating the specs for the open APIs, we have gone the next step to make our API implementation available as open source projects. See the Source Forge project for the MP-API here:

See the sourceforge project for IMA here:

Sunday Apr 17, 2005

More on Multipathing

A reader asked if Solaris native multipathing can do multiple types of multipath/failover simultaneously, allowing connection to multiple types of storage arrays at the same time. The answer is definitely YES.

This highlights another reason why operating systems need to include multipath for block storage. Compute servers connect into networks with a variety of storage, from a variety of vendors. One driver stack, and one set of HBAs has to talk to all of them. The old model of one type of HBA, with one type of driver, for each storage array doesn't work any more than you can have a different NIC for each type of Web server you connect to on the Internet.

Monday Apr 11, 2005

Multipathing. How did we get here?

I saw statistics from one of the computer marketing groups (Gartner, I think) showing that 90% of mid-range servers go into data-centers that use networked storage. Although not mentioned in the report, I bet the vast majority of those servers have high-availability requirements that demand that all storage components be redundant including array controllers, switches, cabling, and host adapters, and that failover happens automatically.

Given this fact, selling a server and OS without multipathing in the storage stack seems like selling cars without drive shafts. Administrators have been buying new servers, THEN they have to buy the missing multipath part from a third party, install it themselves, and since these drivers weren't originally designed as part of the operating system's I/O framework, they don't integrate as well with the OS as they should.

The industry got here because of the way the technology evolved. It was the array vendors who originally invented dual controllers with failover. Some early redundant controllers used ID aliasing for failover. Say controller 1 appeared on SCSI target ID 1, controller 2 at ID 2. If controller 1 died, controller 2 detected that its alternate was no longer there, and began responding to requests for both SCSI IDs 1 and 2. The problem with this approach was that the interconnect and the HBA were still single points of failure so they started developing add-on drivers for each OS they supported. So, administrators got used to buying and installing these drivers and that's the way the industry worked for several years.

Since then, the OS vendors and the standards bodies have finally caught up. The ANSI T10 SCSI committee now has a standard for asymmetric controller failover. Solaris 10 has the concept of virtualized paths built natively into the device tree and it knows how to do path virtualization, as well as controller failover for most common storage arrays. This is available as a patch for S8 and 9 as well.

On the storage side at Sun, we are moving to use the native multipathing in every OS as well. Most Sun arrays work with the native 'md' driver in Linux. See a paper on how to configure it HERE

Windows is adding a native multipathing framework called MPIO and we are working with Microsoft to support Sun storage. (in the meantime, we have a traditional add-on multipath driver available – yes, we know how to write Windows drivers). HP is now bundling the Veritas VM with DMP into HP-UX as their multipath solution so we are negotiating with Veritas to make sure that 'native' solution supports our arrays.

So, stop buying third-party drive shafts and use an OS that has everything needed to start working with your storage.

Friday Apr 08, 2005

More on iSCSI

Two additional comments on iSCSI:

A reader asked if we recommend IPMP (IP Multipathing) or Trunking. We tested and recommend IP Multipathing. We have a Sun Blueprints doc in the works that explains this better, including how it works with embedded storage multipathing (Traffic Manager) in the Solaris OS. Stay tuned for the Blueprints doc. I will post a pointer to it when it's published

Another alert reader pointed out that there ARE iSCSI HBAs (Host Bus Adapters) available that offload iSCSI and TCP/IP protocols with the ability to DMA data directly to/from the user's buffer. These offload the CPU just like Fibre Channel and SCSI HBAs, as well as providing an embedded BIOS for boot support. In my original posting when I said iSCSI uses the main processor, I was referring to the software driver integrated in Solaris 10 Update 1.

Wednesday Apr 06, 2005

Scaling, Fault Management, and Open Standards

I'm having a scaling problem at my house that has taught me the value of Fault Management and Open Standards. My wife and I have two teenagers who drive so we now have four cars that need to be maintained.

Cars have gotten more complex over the years, but I've been happy to learn that they have also developed a great fault management architecture built into every new car. They have sensors all over the engine, transmission, and other critical components that send a stream of telemetry through an event bus to embedded service processors running fault management tasks. These tasks continuously analyze streams of events looking for a failure, or sensor readings consistently out of range, turn on the check engine light and provide a specific problem code without requiring you, or the mechanic to search through logs of raw events.

Now, what's really made this nice for me is combining it with open standards. Our garage is heterogeneous. All four of our cars come from different auto companies but, it turns out that every car sold in the U.S. since 1996 provides a standard interface to the fault management system - a connector usually located under the left side of the dashboard. For about $80 I bought an interface adapter that connects this to a standard serial port on my notebook and I downloaded an open-source program that retrieves and interprets the diagnostic codes. So, I can plug my notebook PC into any of our cars and communicate with the service processor.

I've used it a few times. Once it basically told me that we had to take our daughter's Honda in for a new catalytic converter. In other cases though, it saved trips to the shop. Once, a couple days after bringing my car home from the dealer after routine maintenance, it started running rough. The diagnostic code showed that one spark plug wasn't working and after a quick check under the hood, I found that a spark plug wire wasn't seated properly and reconnected it.

We've built some of these same concepts into Sun's Fault Management Architecture. Through our participation in the SNIA storage management standards group we are helping create similar open standards. Even better, since the vast majority of our servers are connected to the Internet, we have the ability to send codes back to a Sun support center. In many cases, we can send a support engineer to the customer's site before they even know they have a problem.

Monday Apr 04, 2005

Another choice for building storage networks

Although Fibre Channel is the most common, and highest performance interconnect for storage networks, they can also be built using traditional ethernet, or a combination of ether and Fibre Channel using bridging devices, or switches that bridge both interconnects.

Solaris 10 Update 1 will have an integrated iSCSI driver that sends SCSI commands and data over standard ethernet NICs. To applications, file systems, and volume managers, the iSCSI driver looks like the existing SCSI drivers so these layers will run the same on iSCSI/Ethernet as on parallel SCSI or Fibre Channel.

The disadvantage of the iSCSI driver is that more protocol execution is done by the processor, and more significantly, data is copied between user memory and ethernet buffers. Fibre, on the other hand, can DMA directly in and out of application memory as well as offload protocol processing to the HBA.

iSCSI offers at least two benefits though. One is lower cost. It runs on low cost NICs and ethernet switches. Second, the storage network can be managed, in part, using familiar ethernet protocols and tools.

We have customers who are planning to use Sun's rack-mount x64 servers in their data center and who plan to use iSCSI in the Solaris OS on these servers. They have evaluated the performance of these x64 servers and decided they have enough processing power to spare. They like that Solaris iSCSI is integrated (no adding extra patches) and have found it to perform very reliably when faced with a variety of hardware failures and storage network error scenarios. Most plan to bridge iSCSI to FC to access their existing consolidated SAN storage, and some are looking to use native iSCSI arrays.

So, data center administrators have another choice for building storage networks. Either way, the drivers are built-into the Solaris OS, they are highly available, extensively tested and, as always, open based on standards.

Thursday Mar 31, 2005

What is Storage Network Engineering?

Trace the evolution of storage. In the old days, storage was directly connected as a peripheral to the server. The server booted up, discovered it had storage attached and took control of it for its exclusive use. This approach had several limitations. The amount of storage you could attach was limited by the number of I/O slots and storage bus connections (e.g. SCSI IDs). Storage couldn't be easily shared so if two server needed to use the same data, multiple copies were required, and it was hard to pool storage for consolidated management.

In the 90's new interconnects like Fibre Channel (FC) were developed that solved some of these problems. Early FC allowed much higher expandability and performance. It also allowed some amount of pooling and central management of physical storage assets. The term Storage Area Networks (SANs) was used to describe these limited storage networks. A problem though was operating systems still operated under the direct-attached, storage-as-peripheral model. They still tried to scan every storage device that was visible to them, then assumed exclusive control of that storage. OS's also assumed that the configuration couldn't change unless the server was powered-down, or that the number of devices was small enough that the system administrator could list them all in a configuration file. This forced administrators to partition up their SAN through zoning to give each OS only a small view of the SAN that appeared like a small direct-attached configuration, prevented multiple servers from seeing the same storage, and still allowed server to claim exclusive ownership of storage across the FC interconnect.

Today we have entered the age of true storage networks. Data centers now have storage networks with well over a thousand nodes. These are true networks on which the disk arrays and tape libraries have become Servers, providing block storage or tape archiving services. The Compute Servers are now Data Clients that discover and use these shared storage services across a network. The old direct-attach assumptions built into operating systems don't apply anymore. It would take hours to boot a compute server if the OS tried to scan the network for every device visible to it. You can't reboot the compute server every time the network is reconfigured or a new device is added. Also, you can't keep configuration files to describe the physical address of every storage device. Imagine if your web browser required you to maintain a file with the ethernet MAC addresses of every web server you visited.

In Storage Network Engineering we have designed a storage networking protocol stack into the Solaris OS making it the best Data Client on the storage network. Storage is discovered and exposed to applications as needed so boot times are fast. There is no editing config files with physical addresses. Multi-pathing for redundancy, routing through the network, and fail-over is built-in. New storage can be dynamically detected and path changes through the network are detected and adapted to without needing to reboot the Solaris Data Client.

For a more thorough description of the Solaris Storage Network stack see the Sun Blueprints document "Increasing Storage Area Network Productivity" located here:

Ken Gibson's Storage Networking Blog

Weblog for Ken Gibson, Director of Storage Network Engineering in Sun's Network Storage division.



« December 2016