Saturday Oct 10, 2009

Infiniband Performance Limits: Streaming Disk Read and new Summary

 Updated Performance Limit Summary

I was able to squeak out a few more bytes/second in the streaming DRAM test for IPoIB and have achieved a respectable upper bound for RDMA streaming disk reads for this Sun Storage 7410 configuration.  The updated summary is below with links to the relevant Analytics screenshots.  I'll update this summary as I gather more data.

 NFSv3 Streaming DRAM Read
2.93 GBytes/second \*\*
~ 2.40 GBytes/second\*
 NFSv3 Streaming Disk Read
2.11 GBytes/second \*\*
1.47 GBytes/second \*
 NFSv3 Streaming Write
984 MBytes/Second \*\*
752 MBytes/second \*
 NFSv3 Max IOPS - 1 byte reads

 NFSv3 Max IOPS - 4k reads

 NFSv3 Max IOPS - 8K reads

    \* IPoIB

    The IPoIB numbers do not represent the maximum limits I expect to ultimately achieve.  On the 7410, we are well under resource utilization for CPU and disk.  In the I/O path, we are no where close to saturating the IB transport and the hypertransport and PCIe root complexes have plenty of head room.  The problem is the number of clients.  As I develop a better client fabric, expect these values to change.

    \*\* RDMA

    With NFSv3/RDMA, I am able to hit maximum limits with the current client configuration (10 clients).  Except, that is, max IOPS.  In the streaming read from DRAM test , I was able to hit the limit imposed by the PCIe generation 1 root complexes and downstream bus.  For the streaming read/write from/to disk, I am able to reach the maximum we can expect from this storage configuration. The throughput numbers are given in GBytes/second for the transport.   While throughput numbers observed on the subnet manager were higher, I took a conservative approach to reporting streaming write and DRAM read limits.  For this test, I used the IOPS and multiplied by the data transfer size (128K).  For example, we see 24041 (iops) x 128K (read size) = 3.00 GBytes/second for the streaming read frm DRAM test.  Once we have 64-bit port performance counters, I can be more confident in the throughput I observed through them.  For streaming read from disk, I used the reported disk throughput.

    Fabric Configuration

    Filer: Sun Storage 7410, with the following config:

    • 256 Gbytes DRAM
    • 8 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
    • 4 sockets of six-core AMD Opteron 2600 MHz CPUs (Istanbul)
    • 2 Sun DDR Dual Port Infiniband HCA
    • 3  HBA cards
    • noatime on shares, and database size left at 128 Kbytes
    Clients: 10 blades, each:
    • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
    • 3 Gbytes of DRAM
    • 1 Sun DDR Dual Port Infiniband HCA Express Module
    • mount options:
      • read tests: mounted forcedirectio (to skip client caching), and rsize to match the workload
      • write tests: default mount options

    Switches: 2 internal Sun DataCenter 3x24 Infiniband switches (A and C)

      Subnet manager:

      • Centos 5.2
      • Sun HPC Software, Linux Edition
      • 2 Sun DDR Dual Port Infiniband HCA

      NFSv3 Streaming Disk Reads

      I was able to achieve a maximum read limit for NFSv3 streaming read from disk for RDMA.  As with my previous tests, I have a 10 client fabric connected to the same Sun Storage 7410.  The clients are split equally between two subnets and connected to two separate HCA ports on the 7410.  Each client has a separate share mounted.  For the read from disk tests, I'm using all 10 clients each running 10 threads to read 1 MB of data (see Brendan's script) from its own 2GB file.  The shares are mounted with rsize=128K.

      Update on Maximum IOPS

      I'm still waiting to run this set of tests with a larger number of clients.  But in the interim, I wanted to make sure that adding those clients would indeed push me to the limits of the 7410.  To validate my thinking, I ran a step test for the 4k maximum IOPS test.  Here, we can see the stepwise function of adding two clients at a time plus one at the end for a maximum of 9 clients.

      We're scaling nicely: every two clients adds roughly 42000 IOPS per step and the last client adds another 20000.  We're starting to reach a CPU limit but if I add just 5 more clients, I can match Brendan's IOP max of 400K.  I think I can do it!  Stay tuned...

      Tuesday Oct 06, 2009

      Infiniband Performance Limits: Take 1

      As promised in my last post, I upgraded my Infiniband test fabric to include a more powerful Sun Storage 7410.  As luck would have it, Brendan just finished up his tests for the 7410 with the Istanbul processor upgrade and the system was available for IB testing.     In my last set of experiments,  I quickly exhausted my CPU, memory, and disk capabilities with the 8 clients connected to my IB fabric. Here, I've significantly upgraded my filer and added two more clients. 

       Fabric Configuration

      Filer: Sun Storage 7410, with the following config:

      • 256 Gbytes DRAM
      • 8 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
      • 4 sockets of six-core AMD Opteron 2600 MHz CPUs (Istanbul)
      • 2 Sun DDR Dual Port Infiniband HCA
      • 3  HBA cards
      • noatime on shares, and database size left at 128 Kbytes
      Clients: 10 blades, each:
      • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
      • 3 Gbytes of DRAM
      • 1 Sun DDR Dual Port Infiniband HCA Express Module
      • mount options:
        • read tests: mounted forcedirectio (to skip client caching), and rsize to match the workload
        • write tests: default mount options

      Switches: 2 internal Sun DataCenter 3x24 Infiniband switches (A and C)

      Subnet manager:

      • Centos 5.2
      • Sun HPC Software, Linux Edition
      • 2 Sun DDR Dual Port Infiniband HCA

      Most of performance results you'll find reported for Infiniband (RDMA or IPOIB) are limited to cached workloads.  While these types of tests help to evaluate the raw capabilites of the transport, they don't necessarily show how a storage system behaves or what the possible benefits are.  Brendan chose these tests and his workloads to demonstrate the 7410 maximum capabilities.  The goal  of the following experiments is to duplicate what Brendan demonstrated for ethernet and point out where the bottlenecks or problem spots are for Infiniband.


      NFS over the RDMA protocol is available in the 2009.Q3 software release for clients that support it.  RDMA (Remote Direct Memory Access) moves data between memory on one host to another host.  The details of moving data between hosts is left to hardware, in our case the Infiniband HCAs.  The advantage is that we can bypass the network and device software stacks and reduce much of the data copies performed by the CPU.  We should see a reduction in CPU utilization and an increase in the amount of data we can transfer between clients and NFS server.

      Max NFSv3 streaming cached read

      This test demonstrates the maximum read throughput we can achieve  over NFSv3/RDMA.  The test reads a 1GByte file cached entirely in DRAM from the SS7410 filer to 10 clients.  Each client is running 10 threads that are each performing 128KB read accesses from the filer and dumping the data into their DRAM.  This test is effectively the same test used to publish typical results for the IB transport.

      I am able to reach a bit beyond Brendan's 3.06Gyte/sec with half the number of clients and reduce my CPU utilization to just 30%.  In the graph above, we can calculate the throughput by multiplying the number of write IOPS (24041) by the write size (128KB) or 3.15 GBytes/sec.  For confirmation, I can observe the throughput for both IB ports on the subnet manager where we reach 3.18 Gbytes/sec.  3.18 GBytes/sec at the port level includes additional header information imposed by the transport.

      mlx4_0 LID/Port              XMIT bytes/second    RECV bytes/second
                  5/2                         3729230           1500731797
      mthca0 LID/Port              XMIT bytes/second    RECV bytes/second
                  3/2                         2682860           1580370155

      The bottleneck however is the PCI Express 1.0 I/O interconnects.  The PCIExpress 1.0 root complexes can (in practice) reach only 1.4-1.5 GBytes/sec.  Using Brendan's amd64htcpu script, we can see that the  PCIe interconnect are at or near their maximums:

      Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
                0      5011.33      1374.05      4594.51         0.00
                1      6982.65      6366.86      1890.57         0.00
                2      5392.58      4343.28      5773.35         0.00
                3      5228.30      5664.78      4247.36         0.00
           Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
                0      4852.97      1329.00      4442.26         0.00
                1      7011.03      6385.20      1893.62         0.00
                2      5361.24      4331.55      5741.79         0.00
                3      5201.71      5643.10      4244.37         0.00
           Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
                0      6257.99      1705.76      5716.17         0.00
                1      6036.49      5462.77      1614.62         0.00
                2      5380.50      4360.88      5827.16         0.00
                3      5207.53      5586.87      4231.31         0.00

      Max NFSv3 streaming disk read

      As much as I tried, I could not acheive a workload confined strictly to disk reads.  The problem is not with SS7410 but rather the number of clients in my fabric.  In order to obtain results for this test, I will have to increase the number or capabilities of my clients.

      Max NFSv3 streaming disk write

       Using the same 10 IB clients I used in my read experiments, I will drive 2 streaming write threads per client.  Each thread uses a 32KB block size to stream to a separate file residing on a separate share.  

      I was pleasantly surprised to see that we can indeed break the 1 GByte/sec maximum Brendan saw with ethernet.  The 1 GBytes/sec result is obtained by multiplying the NFS write IOPS by the write size.  I am unable to sanity check this result with the network throughput in Analytics as we are bypassing the TCP/IP stack.  I can though, confirm the throughput on the fabric subnet manager using the port counters exported by each HCA port.  According to the port counters, I am seeing roughly 1 GBytes/second receive rate.  Using the port counters is not precise as the time it takes collect the information varies and the counters (being 32-bit in length) can wrap.  But the counters do provide a way to confirm our transport throughput in the absence of Analytics for RDMA.  On the subnet manager, mlx4_0 (LID/Port 5/2) is  attached to switch A and mthca0 (LID/Port 3/2) is attached to switch C in the IB fabric topology.

      mlx4_0 LID/Port              XMIT bytes/second    RECV bytes/second
                  5/2                         333697            518640843
      mthca0 LID/Port              XMIT bytes/second    RECV bytes/second
                  3/2                           3030            518400821

      Max  NFSv3 read ops/sec

      As was the case with streaming reads from disk, my clients are insufficiently configured to push a maximum workload.  I will need to increase the number of clients and try again. 


      The IPoIB protocol uses the TCP/IP network to transmit and receive network packets.  Unlike RDMA that bypassses the network stack, IPoIB suffers from some of the performance implications inherent in the traditional TCP/IP software stack.  

      I re-ran the tests described above and summarize the results here. 

       RDMA  IPoIB
       NFSv3 Streaming DRAM Read
       3.18 GBytes/second  2.24 GBytes/second
       NFSv3 Streaming Disk Read

       Not Available
       NFSv3 Streaming Write
       1.00 Gbytes/second
      753 MBytes/second
       NFSv3 Max IOPS

      As I build up my IB fabric with more or better clients, I'll update the results that I was unable to capture this time around. The next step is to build out and attach the 7410 to a QDR-based fabric with at least 20 clients.  This should provide a client workload large enough to push the 7410 to its maximum potential.

      Tuesday Sep 22, 2009

      New Image for Old Blog

      Yesterday, I posted the wrong image for the sequential read experiment.  That's been corrected and the words now match the image. :)

      Infiniband for Q3.2009

      The lastest release of the SS7000 software includes support for Infiniband HCAs.  Each controller may be configured with a Sun Dual Port Quad Rate HCA (Sun option x4237) in designated slots.  The slot configuration varies by product with up to three HCAs on the Sun Storage 7410.  The initial Infiniband upper level protcols (ULP) include IPoIB and early adopter access for NFS/RDMA.  The same file and block based data protocols (NFS, CIFS, FTP, SFTP, HTTP/WebDav, and iSCSI) we support over ethernet are also supported over the IPoIB ULP.  OpenSolaris, Linux, and Windows clients with support for the OpenFabrics Enterprise Distribution (OFED 1.2, OFED 1.3, and OFED 1.4) have been tested and validated for IPoIB. NFS/RDMA is offered for early adopters of the technology for Linux distributions that run with the 2.6.30 kernel and greater.

      Infiniband Configuration

      Infiniband IPoIB datalink partition and IP interface configuration is easy and painless using the same network BUI page or CLI contexts as ethernet. Port GUID information is available for configured partitions on the network page as shown below. This makes it easy to add SS7000 HCA ports to a partition table on a subnet manager.  Once a port has been added to a partition on the subnet manager, the IPoIB device will automatically appear in the network configuration.  At this point, the device may be used to configure partition datalinks and then interfaces.  If desired, IP network multi-pathing (IPMP) can be employed by creating multi-pathing groups for the IPoIB interfaces.

      HCA and port GUID and status information may also be found on the hardware slot location.  Navigate to the Maintenance->Hardware->Slots for the controller and click on the slot information icon to get see firmware, GUID, status and speeds associated with the HCA and ports.

      Performance Preview

      So how does Infiniband perform in the SS7000 family?  Well, it really depends upon the workload and a adequately configured system.  Here, I'll demonstrate two simple workloads on a base SS7410.


      • Sun Storage 7410
      • Software release Q3.2009
      • 2 x quad core AMD Opteron 2.3 GHz CPUs
      • 64GBytes DRAM
      • 1 JBOD (23 disk, each 750 GB) configured for mirroring
      • 2 logzillas configured for striping
      • 2 Sun Dual Port DDR Infiniband HCA, one port each configured across two separate subnets


      • 8 x blade servers, each containing:
      • 2 x quad core Intel Xeon 1.6 GHz CPUs
      • 3GBystes DRAM
      • 1 Sun Dual Port Infiniband HCA EM
      • Filesystems mounted using NFSv4, forcedirectio
      • Solaris Nevada build 118
      • 8 Solaris Nevada (build 118) clients: 4 clients connected to subnet 1 and 4 clients connected to subnet 2


      • Sun 3x24 Infiniband Data Switch, switches 0 (subnet 1) and 2 (subnet 2) configured across server and clients
      • 2 OFED 1.4 OpenSM subnet managers operating as masters for switch 1 and 2

      The SS7410 is really under-powered (2 CPUs, 64G memory, 1 JBOD, 2 logzillas, DDR HCAs) and no where near its operational limitations. 

      Sequential Reads

      In this experiment, I used a total of 8 clients with up to 5 threads each performing sequential reads from a 10GB file in a slightly modified version of Brendan's script.  The clients are evenly assigned to each of the HCA ports.  More than 5 threads per-client did not yield any significant gain as I hit the maximum amount I could get from the CPU.  I ran the experiment twice: once for NFSv4 over IPoIB and once for NFSv4/RDMA. As expected, IPoIB yields better results with smaller block sizes but I was surprised to see IPoIB outperform NFS/RDMA with 64K transfer block sizes and stay in the running with every size in between.

      I'm using the default quality of service (QOS) on the subnet manager and clients that are evenly assigned to each of the HCA ports.  As a result, we can see a nice even distribution of network throughput across each of the devices and IOPS per-client. 

      Synchronous Writes

      In the read experiment, I was able to hit an upper bound on CPU utilization at about 8 clients x 5 threads.  What will it take to reach a maximum limit for synchronous writes?  To help answer that question, I'll use a stepping approach to the single synchronous write experiment above.  Looping through my 8 clients at one minute intervals, I'll add a 4K synchronous write thread every second until the number of IOPS levels.  At about 10 threads per client, we start to see the the number of IOPS reach a maximum.  This time CPU utilization is below its maximum (35%) but latency turns into a lake-effect ice storm.  We eventually top out at 38961 write IOPS for our 80 client threads.

      As a sanity check, I also captured the per-device network throughput.  If I account for the additional NFSv4 operations and packet overhead, 93.1MB/sec seems reasonable.  I ran this experiment with NFS/RDMA and discovered a marked drop-off (30%) in IOPS when run for a long period.  Until then, NFS/RDMA performed as well as IPoIB.  Something to investigate.


      I have a baseline for my woefully underpowered SS7410.  For sequential reads, I quickly bumped into CPU utilization limits at  40 client threads.  With the synchronous write workload, I top out 38691 IOPS due to increased disk latency.  But all is not lost, the SS7410 is far from its configurable hardware limitations.  The next round of experiments will include:

      • Buff up my 7410: give it two more CPUs and double the memory to help with reads
      • Add more JBODS and logzillas to help with writes
      • Configure system into a QDR fabric to help the overall throughput

      Tuesday Sep 08, 2009

      Better Late, Than Never

      I've been remiss in posting and completely missed reporting on a couple of new Q2 features for the Fishworks software stack. In Q2, we introduced three new secure data protocols: HTTPS, SFTP, and FTPS.  Dave Pacheco covers HTTPS in his blog so I'll highlight SFTP and FTPS here.


      Our FTP server is built from the proFTPD server software stack.  In Q2, we updated the server to version 1.3.2 to take in a number of critical bug fixes and add support for FTP over SSL/TLS (FTPS).  The proFTPD server implements FTP over SSL/TLS in accordance with the FTP Security Extensions defined by RFC 2228.  Not all FTP clients support the FTP security extensions but a list of clients that do may be found here.

      Enabling FTPS on a Fishworks appliance is very simple.  From the FTP service configuration BUI page or CLI context, an administrator may optionally select to turn on FTPS for the default port or an alternate port.  If FTPS is enabled for a port, the FTP server will use TLSv1 for its authentication and data channels. 


      The SSH File Transfer Protocol (SFTP) is a network protocol that provides file transfer over SSH. SFTP is a protocol designed by the IETF SECSH working group.  SFTP does not itself provide authentication and security but rather delegates this to the underlying protocol. For example, the SFTP server used on the SS7000 is implemented as subsystem of OpenSSH software suite.  The SFTP server is responsible for interpreting and implementing the SFTP command set but authentication and securing the communication channels over which the server transfers data is the responsibility of the underlying OpenSSH software.  SFTP should not be confused with:

      • FTP over SSL/TLS (FTPS)
      • FTP over SSH
      • Simple File Transfer Protocol
      • Secure Copy (SCP)

      Configuration of the SFTP service is very similar to SSH.  The default port for SFTP is set to 218.  This port was selected as it does not conflict with any other ports in a Fishworks appliance and does not interfere with SSH communication for administration (port 22).  As with FTP and HTTP, shares may be exported for SFTP access by selecting read-only or read-write access from the Shares->Protocol BUI page or CLI context.

      Battle of the SSL All-Stars

      If you're an administrator pondering which secure protocol (HTTPS, FTPS, or SFTP) to choose, you're main consideration will be client support.  Not all clients support all protocols.  FTPS is limited in client adoption and the SFTP IETF specification has yet to be finalized.  Secondary to client support will likely be performance.  Using a simple file upload workload of a 10GB file, we can easily compare the three protocols.  All three protocols use OpenSSL for encryption and decryption, so we would expect each protocol to be impacted pretty much the same for secure transfers.  In the following image, we see from top to bottom, the raw network throughput for SFTP, FTPS, and HTTPS.

      To be fair to FTPS and HTTPS, I used curl(1) and the native version of sftp(1) on the Solaris client (the Solaris version of curl did not support the SFTP protocol).  Even so, HTTPS transfer rates are clearly lagging at almost 50% of FTPS.

      Not surprising, CPU utilization increases by as much as 50% on the client and 10% on the server for a FTPS upload as compared to its non-SSL counterpart.    Thanks to Bryan and the new FTP (and SFTP) analytics, we can see the difference in FTP (top pane) vs FTPS (bottom pane) data rates.  FTPS can be as much as 84% slower than FTP.  Ouch!

      Your mileage may vary with your workload but its nice to have the tools at-hand to get a accurate assessment of each protocol and SSL implementations.

      Thursday Mar 12, 2009

      1.0.4 Software Update Released Today

      Today's release of the Series 7000 software (1.0.4) update includes significant fixes for disk hotplug and cluster rejoin and restart.  See the Fishworks wiki or the Sun Download Center for more details on the specific issues addressed.

      Friday Mar 06, 2009

      1.0.3 Software Update Released Today

      Continuing on our path to perfection, the next software update for the Sun Storage 7000 series was released today.  The update is applicable to all Series 7000 platforms and contains a critical fix for the CIFS server.  Customers using the CIFS feature are strongly encouraged to update their 7110, 7210, and 7410 systems to ak-2008.

      Downloads are available from the Sun Download Center.  A matrix of our software update releases for the Series 7000 may be found on the Fishworks wiki.  The matrix includes additional release information and a list of bugs fixed in each release.

      Wednesday Feb 11, 2009

      1.0.2 Software Update Released Today

      The second software update for the Sun Storage 7000 series was released today.  The update is applicable to all Series 7000 platforms and contains a critical fix for the IPv6 network stack and addresses some problems in the NDMP back-up service.  Customers using IPv6 for network connectivity or NDMP for back-up are strongly encouraged to update their 7110, 7210, and 7410 systems to ak-2008.

      Downloads are available from the Sun Download Center.  A matrix of our software update releases for the Series 7000 may be found on the Fishworks wiki.  The matrix includes additional release information and a list of bugs fixed in each release.

      Tuesday Jan 13, 2009

      First Software Update Available for Sun Storage Series 7000

      The first software update for the Series 7000 was released yesterday.  The update is applicable to all Series 7000 platforms and contains a critical fix for the CIFS server.  Customers using CIFS are strongly encouraged to update their 7110, 7210, and 7410 systems to ak-2008.

      Downloads are available from the Sun Download Center.  A matrix of our software update releases for the Series 7000 may be found on the Fishworks wiki.  The matrix includes additional release information and a list of bugs fixed in each release.

      Tuesday Dec 02, 2008

      A Visual Look at Fishworks

      So what does a project like Fishworks look like? 

      I put together a visual representation of the Fishworks project using Code Swarm that has been captured in a video. The video shows how the Fishworks team and project evolved based on changes made to the source code over the course of two and half years.  The code swarm tool uses organic visualization techniques to model the history of a project based on source code files and their relationship to the developers that create and modify them.  It's a very cool tool and a bit addictive.

      There are a number of code swarm project visualizations available online.  The OpenSolaris and Image Packaging System (IPS) projects have been represented by Code Swarm based on raw commit data made to  each project. The OpenSolaris Code Swarm pulses as a single blob of source as developers come and go within the orbit of a vast code base.  In contrast, the Fishworks project shows well-defined orbits surrounding each developer.  This is a testament to the almost constant activity of a small number of developers on a well-partitioned source base.  I have elided gate re-synchronizations to better represent the project and the contributions of each developer.  This avoids single bursts of activity by what seems to be one developer as seen in the IPS Code Swarm.

      Fishworks CodeSwarm from John Danielson on Vimeo

      Code Swarm runs natively in Subversion and Mercurial repostitories.  The Fishworks project source base was controlled by SCCS with logs created by Teamware.  I converted the Teamware 'putback' logs to the Code Swarm input XML format .  I do wonder how the visualization would change if I accounted for lines of code changed per file.  I might be able to use Eric's code tracking script to generate suitable input.

      In the meantime, enjoy the show.


      Sunday Nov 09, 2008

      Fishworks: A Brief Introduction

      Fishworks is the name of a team of engineers at Sun Microsystems.  The FISH in Fishworks is an acronym for Fully Integrated Software and Hardware and is the underlying software that unites operating system functionality, a pleasing user interface, and hardware capabilities to create a plug-it-in-and-it-just-works experience for appliances such as that found in the new Sun Storage 7XXX product line.

      At the top of the Fishworks appliance stack is a new user interface.  A single AJAX based development environment supports a web browser UI and scriptable CLI. To the extent possible, functions available in the BUI are mirrored in the CLI and vice-versa. In many cases, the same Javascript is shared between the two.  I think of the UI as the Little Black Dress (LBD) of the user interface world; it's simple, elegant, and looks absolutely fabulous.

      In some sense, Fishworks was born some 8 years ago with some key innovations that went into Solaris 10 for storage (ZFS), observability (DTrace),  management (SMF), and RAS (FMA). These technologies delivered the right set of abstractions and capabilities necessary to build our appliance software. The Fishworks software stack uses these technologies and other operating system libraries to create the environment in which users and administrators interact via the UI. The control point for appliance operation, configuration, and management is not the operating system but rather the Fishworks appliance software. The operating system can be thought of as the "micro-code" in our software stack and the appliance software controls base operating system functions and hardware for a simple, just-works experience.

      As an example, all distinct functionality is expressed as a SMF service in Solaris 10. The appliance software uses SMF libraries to monitor, configure and restart all services in a system when changes are directed from our UI.  The appliance software also permits other  information to be integrated with SMF manifest and methods for a fully integrated experience.  The NFS service, for example, is controlled by traditional SMF methods for starting and stopping NFS daemons but we also integrate NFS specific configuration properties that may be update on-the-fly. The Fishworks software takes care of updating the new configuration properties and restarting the NFS service.  In Solaris, this task would require the contents of /etc/defaults/nfs to be modified and the NFS SMF service to be restarted in a multi-step process. In a Fishworks appliance, the same set of tasks are accomplished from a single UI dialog.  SMF is but one of the Solaris 10 technologies we leveraged for the Fishworks appliance software.  ZFS, DTrace, and FMA play key roles in storage, analytics, and system health monitoring.

      The Fishworks software is designed to be extensible and applicable to other types of appliances.  New appliance prototypes may be created by simply adding a new "class" and the necessary metadata to describe features and purpose. Over the last few months, I've prototyped a couple of non-storage appliances.  I was amazed by how quickly I had a functioning appliance up and running.  You can imagine how powerful this is going forward.  We now have the foundation to rapidly build new fully integrated systems and I'm really excited to continue work on some new prototypes and look for ways to build on the current developer environment.  If that wasn't enough, I get to work with an incredibly talented bunch of people.


      Friday Jun 08, 2007


      Just today, I posted the first draft of the Sensor Abstraction Layer design document. The project addresses the problem of aggregating and analyzing telemetry exported by disparate sources such that the results may be observed via standard interfaces. The basic design is composed of three distinct sub-layers: a provider layer, a collection layer and a analyzer layer. At the lowest level, the provider layer exports interfaces to read sensor or statistical values without having to understand the implementation details of the subs-system exporting the telemetry.

      Telemetry data is logged according to collection parameters established for a collector . Sensor telemetry is passed from collectors to the analyzer layer for the purpose of online analysis. For example, we may want to collect telemetry for our network sub-system based upon GLD-aware NIC driver kstats, protocol-specific errors and memory usage as seen in netstat(1M) to help predict unhealthy hardware or software or to ensure QOS guarantees.

      We can use many of the concepts and the infrastructure developed for the Solaris Fault Manager. For example, telemetry data can be passed as FMA standard events and logged using the Extended Accounting format developed for the errlog and fltlog. We can also leverage the fmd(1M) tool set to observe telemetry logs and analysis results.




      Hope to have more details soon...




      Friday May 04, 2007

      Solaris Fault Management: A Look Back and Looking Forward

      The Solaris Fault Management Architecture has come a long way since Mike Shapiro and I started talking about it way back in 2001. We started out with a bang as the industry leader in fault management technology:

    • August 10, 2001: First discussions of a new approach to fault management begin at Sun.

    • January 15, 2002: First internal presentation of plans for a Solaris Fault Management Architecture

    • March 18, 2004: FMA integrates into Solaris 10 Build 56, providing CPU/Mem for US-III and IV

    • March 7, 2005: FMA ships to customers as part of Solaris 10 G/A


    • The members of our original development team have changed along the way, but our commitment to improving the architecture and adding new content remains steadfast. Since the introduction of FMA in Solaris 10, additional content has been added to support new platforms and extend FMA concepts into other subsystems. Just look at what we've delivered since S10 was released a short 2 years ago:

      • New for SPARC: US-IV+, US-T1, Niagara & Niagara-2, Fire PCI-E I/O

      • New for x64: CPU/Memory error handling and diagnosis for AMD Opteron and Athlon 64

      Enables all detector banks and sets all documented MCi_CTL bits

      Full machine-check and error-poller handling for all error types documented in the BKDG

      Diagnosis engine rules for all error types

      Response agent: core offline, page retire

      • New for x64: PCI-Express

      Diagnostic correlation based on transmit/receiver error information

      Connections to platform machine-check error handling

      Connections to FMA-aware leaf drivers for increased availability and diagnosability

      Diagnosis engine rules for all error described in PCI-E Base Specification

      Generates SNMP traps (notifications) for FMA diagnosis

      FM MIB permits additional details by UUID

      Web browsable interface to view

      3730 FMA Events

      338 FMA Knowledge Articles

      CLIs to extract event payload and message content

      • New for Developers: Public interfaces for IO FMA

      Updated WDD chapter for writing FMA-aware drivers

      • Deployment: FMA Demo Package

      Infrastructure to inject errors in a simulation environment

      What's best is that Solaris FMA is getting noticed and showing real benefits. The Sun Service organization estimates that platforms shipping without FMA support can cost $252 per-unit per-year. Let's do the math...if Sun sells 100,000 units per year that means after 3 years, Solaris with FMA is saving Sun $75,600,000.

      100000 units per year x $252 per unit x 3 years = $75,600,000

      I don't know about you, but I wouldn't mind saving $75,000,000.00 a year. A paper presented by Mike Shapiro and Dong Tang at the Dependable Systems Network 2006 demonstrated a decrease in annual system downtime by 37-54% using quantitative analysis of the FMA memory retirement capabilities. InfoWorld gave Solaris FMA a nod by awarding our team members its 2005 Innovation of the Year Award.

      So, what are we working on now? Well, we are continuing to deliver on the promise of Predictive Self-Healing. Work is on-going to support out-the-door fault management capabilities for new processors, platforms and I/O subsystems. With the announced support for Intel on Solaris (or is it Solaris on Intel?), we are busily working on a FMA implementation for Intel processors. Solaris will be the first OS to take full advantage of industry-leading x86 processor error handling features. In the I/O space, we are beefing up leaf drivers, adding FMA error handling and diagnosis for SCSI problems and using SMART disk data to actively predict impending disk failures for all platforms. The Xen project gives us an opportunity to deploy a FMA in a virtualized environment. We'll take some of the infrastructure we delivered for LDOMs and use it to connect hypervisor error handling to a DOM0 diagnosis environment. But that's not all...we are looking at ways to use sensor telemetry to offer better fault prediction, manage resource guarantees and power budgeting. On the software front, we are modifying the techniques we've used to diagnose hardware problems to be useful for software diagnosis. This is a huge under-explored area that will keep Solaris in the fore-front with leading-edge availability and serviceability.

      Stay tuned, we're not done with FMA just yet.





      Top Tags
      « August 2016