One Size Doesn't Fit All

Larry McIntosh, Systems Technical Marketing

One size doesn't fit all today—at least when it comes to data access within the datacenter.  As much as we would like for this to be true, there are a number of issues that challenge us all regarding this.  Overall, we must really look at many different things, such as scaling requirements, retention of information, keeping data safe, file system availability, and sharing data in heterogeneous environments.

Today's datacenter provides services for data-intensive applications that can run on different machine types, connected through multiple networks.  Using Sun's Magnum Infiniband technology, Constellation System Blades, Sun's StorageTek Offerings—and of course Sun's Sun Fire X4500 Series Storage Servers—we can now provide solutions that service very disparate networks and platforms.  We can also help to focus on the care and feeding of data to assure that data is kept soundly and can represent various instances of the data for recall purposes.  This can be achieved, given a combination of architecture and technology within Sun's file systems, and deployed in conjunction with an appropriate business continuance model.

Various combinations of software and hardware can address different problems, depending upon requirements for performance and availability, not to mention the budgets and goals an organization has regarding their data. The following picture summarizes five file systems from Sun Microsystems that provide a great deal of flexibility in building solutions. (I am aware that a “file system purist” may have difficulty in classifying all of these as “file systems”, but—in the spirit of this discussion—I also would like to suggest that “file system” can also refer the software service that provides access to data!) Let me try to put some perspective on these.

Solaris ZFS
ZFS can be used for excellent local I/O within a Solaris-based platform.  ZFS's architecture can support very large amounts of storage.  It is extremely simple to administrate and is focuses well on data integrity.  Every block is check-summed to prevent silent data corruption.  ZFS's data is self-healing in mirrored configurations.  If one copy is damaged, ZFS detects it and uses another copy to repair it.  ZFS has fast software-based RAID, using a new model called RAID-Z that is similar to RAID-5. Unlike RAID-5, it uses variable stripe width that eliminates stripe corruption that can occur due to loss of power between data and parity updates.  The file system also implements dynamic disk scrubbing to enhance reliability by reading the data to detect latent errors while they are still correctable.  This dynamic activity traverses the ZFS storage pool to read all data and verify it against its 256-bit checksum.  If necessary, ZFS repairs the data as it finds it.  All of this is happening under the covers while the file system is up and actively servicing clientele.

ZFS's file system for storage servers have been implemented as both direct-attached and SAN-based architectures.  We have deployed ZFS at a number of customer sites across industry sectors and it has been very successful with streaming video and mail stores.

ZFS has great design for latency.  There is a separate ZFS Intent Log (ZIL) that also provides for further File System Stability.  In addition, ZFS has an Adaptive Replacement Cache (ARC), where it keeps pages in memory to improve performance for ZFS buffering.

We are all pretty familiar with NFS as a vehicle for file sharing of data across a network, since it has been around for forever (“forever” in computer technology terms, that is...)  Data can be accessed through ZFS Solaris-based servers via NFS for both Linux and Solaris clients.  Combining ZFS with NFS is very successful where users are already happy with the performance of NFS.  In combination with NFS, ZFS can support client counts just as other NFS/NAS type implementations can support, but with the added bonus of all the underlying data integrity and care of data one acquires by utilizing ZFS as a local file system store on the NFS Server.

Sun StorageTek QFS
Sun StorageTek QFS is a SAN-based shared file system.  It can service hundreds of clients with petabytes of storage.  It is very fast and I have personally experienced near “wire speed” with large implementations across the globe.  That said, the associated block storage architecture of shared storage LUNs is challenged in supporting both increased SCSI command queue depths and increased session counts per individual HBA. This is really a LUN block storage issue with shared SAN access storage arrays. Each Fibre Channel HBA on the storage array eventually can not scale any further beyond around 128 nodes.  So, one must size this correctly to be successful anytime one utilizes SAN architecture for file sharing with any file system – not only QFS.  There is also a feature one can add to support the sharing of heterogeneous implementations of SAN attached systems discussed below.  QFS has been very successful at supporting heterogeneous clients where there were problems scaling NFS as a single mount point.  QFS has supported these implementations at much higher speed for data throughput and sized correctly works very well.

Sun has another powerful solution with the Sun SAM file system, which is fully integrated with QFS.  In fact, folks will often refer to this as “SAM-QFS.”  SAM is used for data backup and archiving.  One can devise clear policies that can be very granular for hierarchical storage management (HSM) to ensure copies and versions of data are kept either online or on tape.  Data can be staged in and out of given storage pools, based upon levels of data access one would like to manage and control.  SAM is also be able to extend itself to other file systems beyond what I have described.  For example, I know of one case I have worked on with IBM GPFS, in which SAM was the HSM of choice, providing long term care of GPFS Data via backup/archive through SAM.  We have built a production environment that takes advantage of this: data is staged in-and-out of GPFS and to-and-from SAM-QFS for backup and archive data care.  There is heterogeneous support of data  accessed by Linux clusters, IBM Power Systems, as well as Sun Systems.  This is accomplished by copying data into SAM from the GPFS File Storage pools.  Once in the SAM storage pool, the data is kept based upon policies of the HSM.  Data sharing can also occur between QFS and the IBM systems via IBM's Tivoli SANergy software, supporting QFS for even a more direct data access path.

The Lustre File System
Lustre is another file services provided by Sun.  It has strength in networking with the networking protocols it utilizes for access to data across many different types of interconnects, such as Quadrics, Ethernet, Myrinet, and Infiniband.  The more common forms of this type of data access is either Ethernet or Infiniband today.  Lustre also has strengths in scaling very well and has been successful at servicing tens of thousands of clients concurrently.  It utilizes an object based storage method to stripe data across object storage servers for very fast highly intensive I/O for Linux based clusters today.  It can service many Petabytes of data and this service grows continuously.  So one can focus on the use of Lustre where NFS does not scale well with Linux based clusters.  Traditionally HPC clusters have requirements for Data Bandwidth which can be very demanding on up to hundreds of Gigabytes of data throughput per second.  Lustre can service clientele very well in this area.  There are failover scenarios one can implement to assure data can be accessed should an object storage server go down or meta data services are impacted for Lustre to provide continuous access to data.  We have been very successful at implementing Lustre with Sun Fire X4500 Series Servers through Infiniband HCAs and drivers.

In summary, one can utilize a simple implementation with ZFS today along with NFS for file sharing on smaller sized client counts and be extremely successful with overall data integrity and stability that ZFS offers.

In addition, QFS offers a good mid sized offering with fully integrated SAM features that were discussed.  On another note, QFS also has a Sun clustering feature which also supports business continuance operations requiring high availability of data.

Finally, Lustre offers scaling to the top of the most highest required client counts and bandwidth requirements as well as scalable storage offerings.  Sun has also implemented Lustre on smaller sized and mid sized clusters with great success so don't just get the idea the Lustre is only for the very large deployments.  I have been personally involved with customer accepted systems based upon customer verified data write performance with RAID 5 implemented on systems ranging in size from 6 Sun Fire x4500s on up to 72 Sun Fire x4500s that scale very well with Lustre.

The Texas Advanced Computer Center (TACC): and Sun deployed the aforementioned 72 Sun Fire x4500s along with Sun's Constellation System and Infiniband technologies for TACC's Ranger System which services the National Science Foundation Researchers data intensive computing requirements.  Further details of TACC's Ranger system can be found here:

So, that's fine, but Sun and TACC also combined the services of Lustre with SAM's ability to perform backups and archiving while we deployed the Sun Constellation TACC Ranger System.  Through the use of Data Movers we were able to have both File Systems deployed working in unison with one another servicing the NSF Researchers.

Once again we are staging data into and out of Lustre from SAM for the data care aspects associated with business continuance.

So, even though one size doesn't fit all, there really is more than one way to peel this onion as we successfully have shown via TACC as well as other deployments we have done.

Sun has had success in implementing combinations of these file systems to date in meeting differing demands.  We have done this as described with the combination of ZFS and NFS.  When one needs very high I/O BW at Gigabytes per second for Infiniband access to data for Linux clusters I have implemented Lustre.  (It is worth mention that there is also a Linux client for QFS.)  When one combines this Linux client for QFS on a Linux Lustre client, one has a powerful data mover that can be used between file systems.  In addition, there are implementations that use both file systems through such software as GridFTP, a high-performance, secure, reliable data transfer protocol optimized for high-bandwidth wide-area networks.  This also has been implemented in conjunction with Lustre to provide shared access to a SAM's HSM infrastructure.  Data can be copied from Lustre into and out of SAM, where once placed, the automated policy engines of SAM kick in for business continuance and care of data more long term.  So you see, combinations of these file systems can provide for extreme care of data.

On a final note, Sun is working to combine these file system services together.  It has been discussed publicly that Lustre will utilize ZFS as the pure file system of choice, moving forward to have Lustre run on top of ZFS.  Why? To provide all of the relly great features of both ZFS and Lustre, combined together to enhance data access, data integrity, data performance, etc. under a single combined file system.  In addition, similar work is underway to extend the HSM services described herein to that same associated ZFS and Lustre combined offering just described.  Why again? Well as we started this dialog—one size does not fit all—or will it sooner than later?

Until Later –
Cheers -- Larry


How about positioning the various file systems based on different application workload? eg, OLTP database, HPC, data warehouse, SAP and etc. This would be more practical and make it easy for folks to relate to. Thanks!

Posted by Anon on August 30, 2008 at 11:52 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed



« July 2016