SPARC T4-1 Server Outperforms Intel (Westmere AES-NI) on IPsec Encryption Tests

Oracle's SPARC T4 processor has significantly greater performance than the Intel Xeon X5690 processor when both are using Oracle Solaris 11 secure IP networking (IPsec). The SPARC T4 processor using IPsec AES-256-CCM mode achieves line speed over a 10 GbE network.

  • On IPsec, SPARC T4 processor is 23% faster than the 3.46 GHz Intel Xeon X5690 processor (Intel AES-NI).

  • The SPARC T4 processor is only at 23% utilization when running at its maximum throughput making it 3.6 times more efficient at secure networking than the 3.46 GHz Intel Xeon X5690 processor.

  • The 3.46 GHz Intel Xeon X5690 processor is nearly fully utilized at its maximum throughput leaving little CPU for application processing.

  • The SPARC T4 processor using IPsec AES-256-CCM mode achieves line speed over a 10 GbE network.

  • The SPARC T4 processor approaches line speed with fewer than one-quarter the number of IPsec streams required for the Intel Xeon X5690 processor to achieve its peak throughput. The SPARC T4 processor supports the additional streams with minimal extra CPU utilization.

IPsec provides general purpose networking security which is transparent to applications. This is ideal for supplying the capability to those networking applications that don't have cryptography built-in. IPsec provides for more than Virtual Private Networking (VPN) deployments where the technology is often first encountered.

Performance Landscape

Performance was measured using the AES-256-CCM cipher in megabits per second (Mb/sec) aggregate over sufficient numbers of TCP/IP streams to achieve line rate threshold (SPARC T4 processor) or drive a peak throughput (Intel Xeon X5690).

Processor GHz AES Decrypt AES Encrypt
B/W (Mb/sec) CPU Util Streams B/W (Mb/sec) CPU Util Streams
– Peak performance
SPARC T4 2.85 9,800 23% 96 9,800 20% 78
Intel Xeon X5690 3.46 8,000 83% 4,700 81%
– Load at which SPARC T4 processor performance crosses 9000 Mb/sec
SPARC T4 2.85 9,300 19% 17 9,200 15% 17
Intel Xeon X5690 3.46 4,700 41% 3,200 47%

Configuration Summary

SPARC Configuration:

SPARC T4-1 server
1 x SPARC T4 processor 2.85 GHz
128 GB memory
Oracle Solaris 11
Single 10-Gigabit Ethernet XAUI Adapter

Intel Configuration:

Sun Fire X4270 M2
1 x Intel Xeon X5690 3.46 GHz, Hyper-Threading and Turbo Boost active
48 GB memory
Oracle Solaris 11
Sun Dual Port 10GbE PCIe 2.0 Networking Card with Intel 82599 10GbE Controller

Driver Systems Configuration:

2 x Sun Blade 6000 chassis each with
1 x Sun Blade 6000 Virtualized Ethernet Switched Network Express Module 10GbE (NEM)
10 x Sun Blade X6270 M2 server modules each with
2 x Intel Xeon X5680 3.33 GHz, Hyper-Threading and Turbo Boost active
48 GB memory
Oracle Solaris 11
Dual 10-Gigabit Ethernet Fabric Expansion Module (FEM)

Benchmark Configuration:

Netperf 2.4.5 network benchmark adapted for testing bandwidth of multiple streams in aggregate.

Benchmark Description

The results here are derived from runs of the Netperf 2.4.5 benchmark. Netperf is a client/server benchmark measuring network performance providing a number of independent tests, including the TCP streaming bandwidth tests used here.

Netperf is, however, a single network stream benchmark and to demonstrate peak network bandwidth over a 10 GbE line under encryption requires many streams.

The Netperf documentation provides an example of using the software to drive multiple streams. The example is not sufficient to develop the workload because it does not scale beyond a single driver node which limits the processing power that can be applied. This subsequently limits how many full bandwidth streams can be supported. We chose to have a single server process on the target system (containing either the SPARC T4 processor or the Intel Xeon processor) and to spawn one or more Netperf client processes each across a cluster of the driver systems. The client processes are managed by the mpirun program of the Oracle Message Passing Toolkit.

Tabular results include aggregate bandwidth and CPU utilization. The aggregate bandwidth is computed by dividing the total traffic of the client processes by the overall runtime. CPU utilization on the target system is the average of that reported by all of the Netperf client processes.

IPsec is configured in the operating system of each participating server transparently to Netperf and applied to the dedicated network connecting the target system to the driver systems.

Key Points and Best Practices

  • Line speed is defined as data bandwidth within 10% of theoretical maximum bit rate of network line. For 10 GbE greater than 9000 Mb/sec bandwidth is defined as line speed.

  • IPsec provides network security that is configured completely in the operating system and is transparent to the application.

  • Peak bandwidths under IPsec are achieved only in aggregate with multiple client network streams to the target server.

  • Oracle Solaris receiver fanout must be increased from the default to support the large numbers of streams at quoted peak rates.

  • The ixgbe network driver relevant on servers with Intel 82599 10GbE controllers (driver systems and Intel Xeon target system) was limited to only a single receiver queue to maximize utilization of extra fanout.

  • IPsec is configured to make a unique security association (SA) for each connection to avoid a bottleneck over the large stream counts.

  • Jumbo frames are enabled (MTU of 9000) and network interrupt blanking (sometimes called interrupt coalescence) is disabled.

  • The TCP streaming bandwidth tests, which run continuously for minutes and multiple times to determine statistical significance, are configured to use message sizes of 1,048,576 bytes.

  • IPsec configuration defines that each SA is established through the use of a preshared key and Internet Key Exchange (IKE).

  • IPsec encryption uses the Solaris Cryptographic Framework which applies the appropriate accelerated provider on both the SPARC T4 processor and the Intel Xeon processor.

  • There is no need to configure a specific authentication algorithm for IPsec. With the Encapsulated Security Payload (ESP) security protocol and choosing AES-256-CCM for the encryption algorithm, the encapsulation is self-authenticating.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/26/2011.

Comments:

I'm somewhat surprised by the comment about netperf being limited in how many drivers can be involved from the System Under Test (SUT). While the example in the manual:

http://www.netperf.org/svn/netperf2/tags/netperf-2.4.5/doc/netperf.html#Running-Concurrent-Netperf-Tests

does only target a single "other system" there is nothing to preclude scripting things to target multiple systems. That is something done by the "runemomniagg2.sh" script which can be found at http://www.netperf.org/svn/netperf2/trunk/doc/examples/runemomniagg2.sh

Also, there can be known issues with the accuracy of the mechanism netperf uses to measure CPU utilization on systems running Solaris - from http://www.netperf.org/svn/netperf2/trunk/src/netcpu_kstat10.c

/* this is now the fun part. we have the nanoseconds _allegedly_
spent in user, idle and kernel. We also have nanoseconds spent
servicing interrupts. Sadly, in the developer's finite wisdom,
the interrupt time accounting is in parallel with the other
accounting. this means that time accounted in user, kernel or
idle will also include time spent in interrupt. for netperf's
porpoises we do not really care about that for user and kernel,
but we certainly do care for idle. the $64B question becomes -
how to "correct" for this?

we could just subtract interrupt time from idle. that has the
virtue of simplicity and also "punishes" Sun for doing
something that seems to be so stupid. however, we probably
have to be "fair" even to the allegedly stupid so the other
mechanism, suggested by a Sun engineer is to subtract interrupt
time from each of user, kernel and idle in proportion to their
numbers. then we sum the corrected user, kernel and idle along
with the interrupt time and use that to calculate a new idle
percentage and thus a CPU util percentage.

that is what we will attempt to do here. raj 2005-01-28

of course, we also have to wonder what we should do if there is
more interrupt time than the sum of user, kernel and idle.
that is a theoretical possibility I suppose, but for the
time-being, one that we will blythly ignore, except perhaps for
a quick check. raj 2005-01-31
*/

So, unless confirmed by other means, I would take the CPU utilization figures for Solaris with a grain or three of salt. Further, one probably cannot rely on the error being systematic.

Posted by rick jones on September 30, 2011 at 12:50 PM PDT #

My first attempt, which included links in support of the discussion was marked as spam, so the short version. While the example in the manual only targets one "other" system, one can indeed target more than one other system with some simple scripting. That is what is done with the runemomniagg2 script in the netperf repository.

Also, there are non-trivial questions about the accuracy of the CPU utilization measurement mechanisms used by netperf to gather CPU utilization under Solaris. And they are not necessarily systematic errors, so one should take the CPU utilization numbers with several grains of salt. Those wanting to discuss the matter further are encouraged to do so in the netperf-talk mailing list.

Posted by rick jones on September 30, 2011 at 12:53 PM PDT #

Reasonable efforts were made using other means of measuring utilization (some as dependent on the kstat interface, some not as dependent) to confirm consistency with the trends reported by netperf.

Posted by participating engineer on October 04, 2011 at 08:12 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today