Saturday Oct 10, 2009

Infiniband Performance Limits: Streaming Disk Read and new Summary

 Updated Performance Limit Summary

I was able to squeak out a few more bytes/second in the streaming DRAM test for IPoIB and have achieved a respectable upper bound for RDMA streaming disk reads for this Sun Storage 7410 configuration.  The updated summary is below with links to the relevant Analytics screenshots.  I'll update this summary as I gather more data.


RDMA IPoIB
 NFSv3 Streaming DRAM Read
2.93 GBytes/second \*\*
~ 2.40 GBytes/second\*
 NFSv3 Streaming Disk Read
2.11 GBytes/second \*\*
1.47 GBytes/second \*
 NFSv3 Streaming Write
984 MBytes/Second \*\*
752 MBytes/second \*
 NFSv3 Max IOPS - 1 byte reads


 NFSv3 Max IOPS - 4k reads


 NFSv3 Max IOPS - 8K reads


    \* IPoIB

    The IPoIB numbers do not represent the maximum limits I expect to ultimately achieve.  On the 7410, we are well under resource utilization for CPU and disk.  In the I/O path, we are no where close to saturating the IB transport and the hypertransport and PCIe root complexes have plenty of head room.  The problem is the number of clients.  As I develop a better client fabric, expect these values to change.

    \*\* RDMA

    With NFSv3/RDMA, I am able to hit maximum limits with the current client configuration (10 clients).  Except, that is, max IOPS.  In the streaming read from DRAM test , I was able to hit the limit imposed by the PCIe generation 1 root complexes and downstream bus.  For the streaming read/write from/to disk, I am able to reach the maximum we can expect from this storage configuration. The throughput numbers are given in GBytes/second for the transport.   While throughput numbers observed on the subnet manager were higher, I took a conservative approach to reporting streaming write and DRAM read limits.  For this test, I used the IOPS and multiplied by the data transfer size (128K).  For example, we see 24041 (iops) x 128K (read size) = 3.00 GBytes/second for the streaming read frm DRAM test.  Once we have 64-bit port performance counters, I can be more confident in the throughput I observed through them.  For streaming read from disk, I used the reported disk throughput.

    Fabric Configuration

    Filer: Sun Storage 7410, with the following config:

    • 256 Gbytes DRAM
    • 8 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
    • 4 sockets of six-core AMD Opteron 2600 MHz CPUs (Istanbul)
    • 2 Sun DDR Dual Port Infiniband HCA
    • 3  HBA cards
    • noatime on shares, and database size left at 128 Kbytes
    Clients: 10 blades, each:
    • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
    • 3 Gbytes of DRAM
    • 1 Sun DDR Dual Port Infiniband HCA Express Module
    • mount options:
      • read tests: mounted forcedirectio (to skip client caching), and rsize to match the workload
      • write tests: default mount options

    Switches: 2 internal Sun DataCenter 3x24 Infiniband switches (A and C)

      Subnet manager:

      • Centos 5.2
      • Sun HPC Software, Linux Edition
      • 2 Sun DDR Dual Port Infiniband HCA

      NFSv3 Streaming Disk Reads

      I was able to achieve a maximum read limit for NFSv3 streaming read from disk for RDMA.  As with my previous tests, I have a 10 client fabric connected to the same Sun Storage 7410.  The clients are split equally between two subnets and connected to two separate HCA ports on the 7410.  Each client has a separate share mounted.  For the read from disk tests, I'm using all 10 clients each running 10 threads to read 1 MB of data (see Brendan's seqread.pl script) from its own 2GB file.  The shares are mounted with rsize=128K.


      Update on Maximum IOPS

      I'm still waiting to run this set of tests with a larger number of clients.  But in the interim, I wanted to make sure that adding those clients would indeed push me to the limits of the 7410.  To validate my thinking, I ran a step test for the 4k maximum IOPS test.  Here, we can see the stepwise function of adding two clients at a time plus one at the end for a maximum of 9 clients.

      We're scaling nicely: every two clients adds roughly 42000 IOPS per step and the last client adds another 20000.  We're starting to reach a CPU limit but if I add just 5 more clients, I can match Brendan's IOP max of 400K.  I think I can do it!  Stay tuned...

      Tuesday Oct 06, 2009

      Infiniband Performance Limits: Take 1

      As promised in my last post, I upgraded my Infiniband test fabric to include a more powerful Sun Storage 7410.  As luck would have it, Brendan just finished up his tests for the 7410 with the Istanbul processor upgrade and the system was available for IB testing.     In my last set of experiments,  I quickly exhausted my CPU, memory, and disk capabilities with the 8 clients connected to my IB fabric. Here, I've significantly upgraded my filer and added two more clients. 


       Fabric Configuration

      Filer: Sun Storage 7410, with the following config:

      • 256 Gbytes DRAM
      • 8 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
      • 4 sockets of six-core AMD Opteron 2600 MHz CPUs (Istanbul)
      • 2 Sun DDR Dual Port Infiniband HCA
      • 3  HBA cards
      • noatime on shares, and database size left at 128 Kbytes
      Clients: 10 blades, each:
      • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
      • 3 Gbytes of DRAM
      • 1 Sun DDR Dual Port Infiniband HCA Express Module
      • mount options:
        • read tests: mounted forcedirectio (to skip client caching), and rsize to match the workload
        • write tests: default mount options

      Switches: 2 internal Sun DataCenter 3x24 Infiniband switches (A and C)

      Subnet manager:

      • Centos 5.2
      • Sun HPC Software, Linux Edition
      • 2 Sun DDR Dual Port Infiniband HCA

      Most of performance results you'll find reported for Infiniband (RDMA or IPOIB) are limited to cached workloads.  While these types of tests help to evaluate the raw capabilites of the transport, they don't necessarily show how a storage system behaves or what the possible benefits are.  Brendan chose these tests and his workloads to demonstrate the 7410 maximum capabilities.  The goal  of the following experiments is to duplicate what Brendan demonstrated for ethernet and point out where the bottlenecks or problem spots are for Infiniband.

      RDMA

      NFS over the RDMA protocol is available in the 2009.Q3 software release for clients that support it.  RDMA (Remote Direct Memory Access) moves data between memory on one host to another host.  The details of moving data between hosts is left to hardware, in our case the Infiniband HCAs.  The advantage is that we can bypass the network and device software stacks and reduce much of the data copies performed by the CPU.  We should see a reduction in CPU utilization and an increase in the amount of data we can transfer between clients and NFS server.

      Max NFSv3 streaming cached read

      This test demonstrates the maximum read throughput we can achieve  over NFSv3/RDMA.  The test reads a 1GByte file cached entirely in DRAM from the SS7410 filer to 10 clients.  Each client is running 10 threads that are each performing 128KB read accesses from the filer and dumping the data into their DRAM.  This test is effectively the same test used to publish typical results for the IB transport.


      I am able to reach a bit beyond Brendan's 3.06Gyte/sec with half the number of clients and reduce my CPU utilization to just 30%.  In the graph above, we can calculate the throughput by multiplying the number of write IOPS (24041) by the write size (128KB) or 3.15 GBytes/sec.  For confirmation, I can observe the throughput for both IB ports on the subnet manager where we reach 3.18 Gbytes/sec.  3.18 GBytes/sec at the port level includes additional header information imposed by the transport.

      mlx4_0 LID/Port              XMIT bytes/second    RECV bytes/second
                  5/2                         3729230           1500731797
      mthca0 LID/Port              XMIT bytes/second    RECV bytes/second
                  3/2                         2682860           1580370155

      The bottleneck however is the PCI Express 1.0 I/O interconnects.  The PCIExpress 1.0 root complexes can (in practice) reach only 1.4-1.5 GBytes/sec.  Using Brendan's amd64htcpu script, we can see that the  PCIe interconnect are at or near their maximums:

      Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
                0      5011.33      1374.05      4594.51         0.00
                1      6982.65      6366.86      1890.57         0.00
                2      5392.58      4343.28      5773.35         0.00
                3      5228.30      5664.78      4247.36         0.00
           Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
                0      4852.97      1329.00      4442.26         0.00
                1      7011.03      6385.20      1893.62         0.00
                2      5361.24      4331.55      5741.79         0.00
                3      5201.71      5643.10      4244.37         0.00
           Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
                0      6257.99      1705.76      5716.17         0.00
                1      6036.49      5462.77      1614.62         0.00
                2      5380.50      4360.88      5827.16         0.00
                3      5207.53      5586.87      4231.31         0.00

      Max NFSv3 streaming disk read

      As much as I tried, I could not acheive a workload confined strictly to disk reads.  The problem is not with SS7410 but rather the number of clients in my fabric.  In order to obtain results for this test, I will have to increase the number or capabilities of my clients.

      Max NFSv3 streaming disk write

       Using the same 10 IB clients I used in my read experiments, I will drive 2 streaming write threads per client.  Each thread uses a 32KB block size to stream to a separate file residing on a separate share.  


      I was pleasantly surprised to see that we can indeed break the 1 GByte/sec maximum Brendan saw with ethernet.  The 1 GBytes/sec result is obtained by multiplying the NFS write IOPS by the write size.  I am unable to sanity check this result with the network throughput in Analytics as we are bypassing the TCP/IP stack.  I can though, confirm the throughput on the fabric subnet manager using the port counters exported by each HCA port.  According to the port counters, I am seeing roughly 1 GBytes/second receive rate.  Using the port counters is not precise as the time it takes collect the information varies and the counters (being 32-bit in length) can wrap.  But the counters do provide a way to confirm our transport throughput in the absence of Analytics for RDMA.  On the subnet manager, mlx4_0 (LID/Port 5/2) is  attached to switch A and mthca0 (LID/Port 3/2) is attached to switch C in the IB fabric topology.

      mlx4_0 LID/Port              XMIT bytes/second    RECV bytes/second
                  5/2                         333697            518640843
      mthca0 LID/Port              XMIT bytes/second    RECV bytes/second
                  3/2                           3030            518400821

      Max  NFSv3 read ops/sec

      As was the case with streaming reads from disk, my clients are insufficiently configured to push a maximum workload.  I will need to increase the number of clients and try again. 

      IPoIB

      The IPoIB protocol uses the TCP/IP network to transmit and receive network packets.  Unlike RDMA that bypassses the network stack, IPoIB suffers from some of the performance implications inherent in the traditional TCP/IP software stack.  



      I re-ran the tests described above and summarize the results here. 


       RDMA  IPoIB
       NFSv3 Streaming DRAM Read
       3.18 GBytes/second  2.24 GBytes/second
       NFSv3 Streaming Disk Read

       Not Available
       NFSv3 Streaming Write
       1.00 Gbytes/second
      753 MBytes/second
       NFSv3 Max IOPS


      As I build up my IB fabric with more or better clients, I'll update the results that I was unable to capture this time around. The next step is to build out and attach the 7410 to a QDR-based fabric with at least 20 clients.  This should provide a client workload large enough to push the 7410 to its maximum potential.

      About

      cindi

      Search

      Top Tags
      Categories
      Archives
      « October 2009
      SunMonTueWedThuFriSat
          
      1
      2
      3
      4
      5
      7
      8
      9
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
             
      Today