Infiniband Performance Limits: Take 1

As promised in my last post, I upgraded my Infiniband test fabric to include a more powerful Sun Storage 7410.  As luck would have it, Brendan just finished up his tests for the 7410 with the Istanbul processor upgrade and the system was available for IB testing.     In my last set of experiments,  I quickly exhausted my CPU, memory, and disk capabilities with the 8 clients connected to my IB fabric. Here, I've significantly upgraded my filer and added two more clients. 


 Fabric Configuration

Filer: Sun Storage 7410, with the following config:

  • 256 Gbytes DRAM
  • 8 JBODs, each with 24 x 1 Tbyte disks, configured with mirroring
  • 4 sockets of six-core AMD Opteron 2600 MHz CPUs (Istanbul)
  • 2 Sun DDR Dual Port Infiniband HCA
  • 3  HBA cards
  • noatime on shares, and database size left at 128 Kbytes
Clients: 10 blades, each:
  • 2 sockets of Intel Xeon quad-core 1600 MHz CPUs
  • 3 Gbytes of DRAM
  • 1 Sun DDR Dual Port Infiniband HCA Express Module
  • mount options:
    • read tests: mounted forcedirectio (to skip client caching), and rsize to match the workload
    • write tests: default mount options

Switches: 2 internal Sun DataCenter 3x24 Infiniband switches (A and C)

Subnet manager:

  • Centos 5.2
  • Sun HPC Software, Linux Edition
  • 2 Sun DDR Dual Port Infiniband HCA

Most of performance results you'll find reported for Infiniband (RDMA or IPOIB) are limited to cached workloads.  While these types of tests help to evaluate the raw capabilites of the transport, they don't necessarily show how a storage system behaves or what the possible benefits are.  Brendan chose these tests and his workloads to demonstrate the 7410 maximum capabilities.  The goal  of the following experiments is to duplicate what Brendan demonstrated for ethernet and point out where the bottlenecks or problem spots are for Infiniband.

RDMA

NFS over the RDMA protocol is available in the 2009.Q3 software release for clients that support it.  RDMA (Remote Direct Memory Access) moves data between memory on one host to another host.  The details of moving data between hosts is left to hardware, in our case the Infiniband HCAs.  The advantage is that we can bypass the network and device software stacks and reduce much of the data copies performed by the CPU.  We should see a reduction in CPU utilization and an increase in the amount of data we can transfer between clients and NFS server.

Max NFSv3 streaming cached read

This test demonstrates the maximum read throughput we can achieve  over NFSv3/RDMA.  The test reads a 1GByte file cached entirely in DRAM from the SS7410 filer to 10 clients.  Each client is running 10 threads that are each performing 128KB read accesses from the filer and dumping the data into their DRAM.  This test is effectively the same test used to publish typical results for the IB transport.


I am able to reach a bit beyond Brendan's 3.06Gyte/sec with half the number of clients and reduce my CPU utilization to just 30%.  In the graph above, we can calculate the throughput by multiplying the number of write IOPS (24041) by the write size (128KB) or 3.15 GBytes/sec.  For confirmation, I can observe the throughput for both IB ports on the subnet manager where we reach 3.18 Gbytes/sec.  3.18 GBytes/sec at the port level includes additional header information imposed by the transport.

mlx4_0 LID/Port              XMIT bytes/second    RECV bytes/second
            5/2                         3729230           1500731797
mthca0 LID/Port              XMIT bytes/second    RECV bytes/second
            3/2                         2682860           1580370155

The bottleneck however is the PCI Express 1.0 I/O interconnects.  The PCIExpress 1.0 root complexes can (in practice) reach only 1.4-1.5 GBytes/sec.  Using Brendan's amd64htcpu script, we can see that the  PCIe interconnect are at or near their maximums:

Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
          0      5011.33      1374.05      4594.51         0.00
          1      6982.65      6366.86      1890.57         0.00
          2      5392.58      4343.28      5773.35         0.00
          3      5228.30      5664.78      4247.36         0.00
     Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
          0      4852.97      1329.00      4442.26         0.00
          1      7011.03      6385.20      1893.62         0.00
          2      5361.24      4331.55      5741.79         0.00
          3      5201.71      5643.10      4244.37         0.00
     Socket  HT0 TX MB/s  HT1 TX MB/s  HT2 TX MB/s  HT3 TX MB/s
          0      6257.99      1705.76      5716.17         0.00
          1      6036.49      5462.77      1614.62         0.00
          2      5380.50      4360.88      5827.16         0.00
          3      5207.53      5586.87      4231.31         0.00

Max NFSv3 streaming disk read

As much as I tried, I could not acheive a workload confined strictly to disk reads.  The problem is not with SS7410 but rather the number of clients in my fabric.  In order to obtain results for this test, I will have to increase the number or capabilities of my clients.

Max NFSv3 streaming disk write

 Using the same 10 IB clients I used in my read experiments, I will drive 2 streaming write threads per client.  Each thread uses a 32KB block size to stream to a separate file residing on a separate share.  


I was pleasantly surprised to see that we can indeed break the 1 GByte/sec maximum Brendan saw with ethernet.  The 1 GBytes/sec result is obtained by multiplying the NFS write IOPS by the write size.  I am unable to sanity check this result with the network throughput in Analytics as we are bypassing the TCP/IP stack.  I can though, confirm the throughput on the fabric subnet manager using the port counters exported by each HCA port.  According to the port counters, I am seeing roughly 1 GBytes/second receive rate.  Using the port counters is not precise as the time it takes collect the information varies and the counters (being 32-bit in length) can wrap.  But the counters do provide a way to confirm our transport throughput in the absence of Analytics for RDMA.  On the subnet manager, mlx4_0 (LID/Port 5/2) is  attached to switch A and mthca0 (LID/Port 3/2) is attached to switch C in the IB fabric topology.

mlx4_0 LID/Port              XMIT bytes/second    RECV bytes/second
            5/2                         333697            518640843
mthca0 LID/Port              XMIT bytes/second    RECV bytes/second
            3/2                           3030            518400821

Max  NFSv3 read ops/sec

As was the case with streaming reads from disk, my clients are insufficiently configured to push a maximum workload.  I will need to increase the number of clients and try again. 

IPoIB

The IPoIB protocol uses the TCP/IP network to transmit and receive network packets.  Unlike RDMA that bypassses the network stack, IPoIB suffers from some of the performance implications inherent in the traditional TCP/IP software stack.  



I re-ran the tests described above and summarize the results here. 


 RDMA  IPoIB
 NFSv3 Streaming DRAM Read
 3.18 GBytes/second  2.24 GBytes/second
 NFSv3 Streaming Disk Read

 Not Available
 NFSv3 Streaming Write
 1.00 Gbytes/second
753 MBytes/second
 NFSv3 Max IOPS


As I build up my IB fabric with more or better clients, I'll update the results that I was unable to capture this time around. The next step is to build out and attach the 7410 to a QDR-based fabric with at least 20 clients.  This should provide a client workload large enough to push the 7410 to its maximum potential.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

cindi

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today