Monday Oct 26, 2015

Virtualized Network Performance: SPARC T7-1

Oracle's SPARC T7-1 server using Oracle VM Server for SPARC exhibits lower network latency under virtualization. The network latency and bandwidth were measured using the Netperf benchmark.

  • TCP network latency between two Oracle VM Server for SPARC guests running on separate SPARC T7-1 servers each using SR-IOV is similar to that of two SPARC T7-1 servers without virtualization (native/bare metal).

  • TCP and UDP network latencies between two Oracle VM Server for SPARC guests running on separate SPARC T7-1 servers each using assigned I/O were significantly less than the other two I/O configurations (SR-IOV and paravirtual I/O).

  • TCP and UDP network latencies between two Oracle VM Server for SPARC guests running on separate SPARC T7-1 servers each using SR-IOV were significantly less than when using paravirtual I/O.

Terminology notes:

  • VM – virtual machine
  • guest – encapsulated operating system instance, typically running in a VM.
  • assigned I/O – network hardware driven directly and exclusively by guests
  • paravirtual I/O – network hardware driven by hosts, indirectly by guests via paravirtualized drivers
  • SR-IOV – single root i/o virtualization; virtualized network interfaces provided by network hardware, driven directly by guests.
  • LDom – logical domain (previous name for Oracle VM Server for SPARC)

Performance Landscape

The following tables show the results for TCP and UDP Netperf Latency and Bandwidth tests (single stream). Netperf latency, often called the round-trip time, is measured in microseconds (usec) (smaller is better).

TCP
Networking
Method
Netperf Latency
(usec)
Bandwidth
(Mb/sec)
MTU=1500MTU=9000 MTU=1500MTU=9000
Native/Bare Metal 5858 91009900
assigned I/O 5151 94009900
SR-IOV 5859 94009900
paravirtual I/O 9191 48009800


UDP
Networking
Method
Netperf Latency
(usec)
Bandwidth
(Mb/sec)
MTU=1500MTU=9000 MTU=1500MTU=9000
Native/Bare Metal 5757 91009900
assigned I/O 5151 94009900
SR-IOV 6663 94009900
paravirtual I/O 9897 48009800
Specifically, the Netperf benchmark latency:
  • is the average request/response time computed by inverse of the throughput reported by the program,
  • is measured within the program from 20 sample-runs of 30 seconds each,
  • uses single-in-flight [i.e. non-burst] 1 byte messages,
  • sends between separate servers connected by 10 GbE,
  • for each test, uses servers connected back-to-back (no network switch) and configured identically: native or guest VM.

Configuration Summary

System Under Test:

2 x SPARC T7-1 servers, each with
1 x SPARC M7 processor (4.13 GHz)
256 GB memory (16 x 16 GB)
2 x 600 GB 10K RPM SAS-2 HDD
10 GbE (on-board and PCIe network devices)
Oracle Solaris 11.3
Oracle VM Server for SPARC 3.2

Benchmark Description

Using the Netperf 2.6.0 benchmark to evaluate native and virtualized (LDoms) network performance. Netperf is a client/server benchmark measuring network performance providing a number of independent tests, including the omni Request/Response (aka ping-pong) test with TCP or UDP protocols used here to obtain the Netperf latency measurements, and TCP stream for bandwidth. Netperf was run between separate servers connected back-to-back (no network switch) by 10 GbE network interconnection.

To measure the cost of virtualization, for each test the servers were configured identically: native (without virtualization) or guest VM. When in a virtual environment, in similar identical fashion on each server, some representative methods were configured to connect the environment to the network hardware (e.g. assigned I/O, paravirtualization, SR-IOV).

Key Points and Best Practices

  • Oracle VM Server for SPARC requires explicit partitioning of guests into Logical Domains of bound CPUs and memory, typically chosen to be local, and does not provide dynamic load balancing between guests on a host.

  • Oracle VM Server for SPARC guests (LDoms) were assigned 32 virtual CPUs (4 complete processor cores) and 64 GB of memory. The control domain served as the I/O domain (for paravirtualized I/O) and was assigned 4 cores and 64 GB of memory.

  • Each latency average reported was computed from the inverse of the reported throughput (similar to the transaction rate) of a Netperf Request/Response test run using 20 samples (aka iterations) of 30 second measurements of non-concurrent 1 byte messages.

  • To obtain a meaningful average latency from a Netperf Request/Response test, it is important that the transactions consist of single messages, which is Netperf's default. If, for instance, Netperf options for burst and TCP_NODELAY are turned on, multiple messages can overlap in the transactions and the reported transaction rate or throughput cannot be used to compute the latency.

  • All results were obtained with interrupt coalescence (aka interrupt throttling, interrupt blanking) turned on in the physical NIC, and if applicable, for the attachment driver in the guest. Also, interrupt coalescence turned on is the default for all the platforms used here.

  • All the results were obtained with large receive offload (LRO) turned off in the physical NIC, and, if applicable, for the attachment driver in the guest, in order to reduce the network latency between the two guests.

  • The netperf bandwidth test used send and receive 1MB (1048576 Bytes) messages.

  • The paravirtual variation of the measurements refers to the use of a paravirtualized network driver in the guest instance. IP traffic consequently is routed across the guest, the virtualization subsystem in the host, a virtual network switch or bridge (depending upon the platform), and the network interface card.

  • The assigned I/O variation of the measurements refers to the use of the card's driver in the guest instance itself. This use is possible by exclusively assigning the device to the guest. Device assignment results in less (software) routing for IP traffic and consequently less overhead than using paravirtualized drivers, but virtualization still can impose significant overhead. Note also NICs used in this way cannot be shared amongst guests, and may obviate the use of certain other VM features like migration. The T7-1 system has four on-board 10 GbE devices, but all of them are connected to the same PCIe branch, making it impossible to configure them as assigned I/O devices. Using a PCIe 10 GbE NIC allows configuring it as an assigned I/O device.

  • In the context of Oracle VM Server for SPARC and these tests, assigned I/O refers to PCI endpoint device assignment, while paravirtualized I/O refers to virtual I/O using a virtual network device (vnet) in the guest connected to a virtual switch (vsw) through the I/O domain to the physical network device (NIC).

See Also

Disclosure Statement

Copyright 2015, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 25 October 2015.

Friday Feb 08, 2013

Improved Oracle Solaris 10 1/13 Secure Copy Performance for High Latency Networks

With Oracle Solaris 10 1/13, the performance of secure copy or scp is significantly improved for high latency networks.

  • Oracle Solaris 10 1/13 enabling a TCP receive window size up to 1 MB has up to 8 times faster transfer times over the latency range 50 - 200 msec compared to the previous Oracle Solaris 10 8/11.

  • The default TCP receive window size of 48 KB delivered similar performance in both Oracle Solaris 10 1/13 and Oracle Solaris 10 8/11.

  • In this study, settings above 1 MB for the TCP receive window size delivered similar performance to the 1 MB results.

  • The tuning of the TCP receive window has been available in Oracle Solaris for some time. This improved performance is available with Oracle Solaris 10 1/13 and Oracle Solaris 11.

Performance Landscape

T4-4_SSH_SCP.png

X4170M2_SSH_SCP.png

Configuration Summary

Test Systems:

SPARC T4-4 server
4 x SPARC T4 processor 3.0 GHz
1 TB memory
Oracle Solaris 10 1/13
Oracle Solaris 10 8/11

Sun Fire X4170 M2
2 x Intel Xeon X5675 3.06 GHz
48 GB memory
Oracle Solaris 10 1/13
Oracle Solaris 10 8/11

Driver System:

Sun Fire X4170 M2
2 x Intel Xeon X5675 3.06 GHz
48 GB memory
Oracle Solaris 10

Router / Programmable Delay System:

Sun Fire X4170 M2
2 x Intel Xeon X5675 3.06 GHz
48 GB memory
Oracle Solaris 10

Switch in between the router and the 2 test systems

Cisco linksys SR2024C

Benchmark Description

This benchmark measures the scp performance between two systems with variable router delays in the network between the two systems. A file size of 48 MB was used while measuring the affects of varying the latency (network delays) and varying the TCP receive window size.

Key Points and Best Practices

  • The WAN emulator (aka. hxbt) is used in the router to achieve delays. Verification of network function and characteristics confirmed after setting the simulator using Netperf latency and bandwidth tests between driver and test system.

  • Transfers performed over 1 GbE private, dedicated network.

  • Files were transferred to and from /tmp (i.e. in memory) on the test systems to minimize effect of filesystem performance and variability on the measurements.

  • Larger TCP receive windows than default can be enabled using the system-wide parameter tcp_recv_hiwat (e.g. to enable 1024 KB windows using this method, use the command: ndd -set /dev/tcp tcp_recv_hiwat 1048576). To make this change persistent the command will have to be added to system startup scripts.

  • sshd on target system must be restarted before any benefit can be observed after increasing the enabled tcp receive buffer size. (e.g: can restart with the command /usr/sbin/svcadm restart svc:/network/ssh:default)

  • Note that tcp_recv_hiwat is a system-wide variable that adjusts the entire TCP stack. Care, therefore, must be taken to make sure that changes do not adversely affect your environment.

  • Geographically distant servers can be affected by connection latencies of the kind presented here.

See Also

Disclosure Statement

Copyright 2013, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 2/08/2013.

Thursday Sep 29, 2011

SPARC T4-1 Server Outperforms Intel (Westmere AES-NI) on IPsec Encryption Tests

Oracle's SPARC T4 processor has significantly greater performance than the Intel Xeon X5690 processor when both are using Oracle Solaris 11 secure IP networking (IPsec). The SPARC T4 processor using IPsec AES-256-CCM mode achieves line speed over a 10 GbE network.

  • On IPsec, SPARC T4 processor is 23% faster than the 3.46 GHz Intel Xeon X5690 processor (Intel AES-NI).

  • The SPARC T4 processor is only at 23% utilization when running at its maximum throughput making it 3.6 times more efficient at secure networking than the 3.46 GHz Intel Xeon X5690 processor.

  • The 3.46 GHz Intel Xeon X5690 processor is nearly fully utilized at its maximum throughput leaving little CPU for application processing.

  • The SPARC T4 processor using IPsec AES-256-CCM mode achieves line speed over a 10 GbE network.

  • The SPARC T4 processor approaches line speed with fewer than one-quarter the number of IPsec streams required for the Intel Xeon X5690 processor to achieve its peak throughput. The SPARC T4 processor supports the additional streams with minimal extra CPU utilization.

IPsec provides general purpose networking security which is transparent to applications. This is ideal for supplying the capability to those networking applications that don't have cryptography built-in. IPsec provides for more than Virtual Private Networking (VPN) deployments where the technology is often first encountered.

Performance Landscape

Performance was measured using the AES-256-CCM cipher in megabits per second (Mb/sec) aggregate over sufficient numbers of TCP/IP streams to achieve line rate threshold (SPARC T4 processor) or drive a peak throughput (Intel Xeon X5690).

Processor GHz AES Decrypt AES Encrypt
B/W (Mb/sec) CPU Util Streams B/W (Mb/sec) CPU Util Streams
– Peak performance
SPARC T4 2.85 9,800 23% 96 9,800 20% 78
Intel Xeon X5690 3.46 8,000 83% 4,700 81%
– Load at which SPARC T4 processor performance crosses 9000 Mb/sec
SPARC T4 2.85 9,300 19% 17 9,200 15% 17
Intel Xeon X5690 3.46 4,700 41% 3,200 47%

Configuration Summary

SPARC Configuration:

SPARC T4-1 server
1 x SPARC T4 processor 2.85 GHz
128 GB memory
Oracle Solaris 11
Single 10-Gigabit Ethernet XAUI Adapter

Intel Configuration:

Sun Fire X4270 M2
1 x Intel Xeon X5690 3.46 GHz, Hyper-Threading and Turbo Boost active
48 GB memory
Oracle Solaris 11
Sun Dual Port 10GbE PCIe 2.0 Networking Card with Intel 82599 10GbE Controller

Driver Systems Configuration:

2 x Sun Blade 6000 chassis each with
1 x Sun Blade 6000 Virtualized Ethernet Switched Network Express Module 10GbE (NEM)
10 x Sun Blade X6270 M2 server modules each with
2 x Intel Xeon X5680 3.33 GHz, Hyper-Threading and Turbo Boost active
48 GB memory
Oracle Solaris 11
Dual 10-Gigabit Ethernet Fabric Expansion Module (FEM)

Benchmark Configuration:

Netperf 2.4.5 network benchmark adapted for testing bandwidth of multiple streams in aggregate.

Benchmark Description

The results here are derived from runs of the Netperf 2.4.5 benchmark. Netperf is a client/server benchmark measuring network performance providing a number of independent tests, including the TCP streaming bandwidth tests used here.

Netperf is, however, a single network stream benchmark and to demonstrate peak network bandwidth over a 10 GbE line under encryption requires many streams.

The Netperf documentation provides an example of using the software to drive multiple streams. The example is not sufficient to develop the workload because it does not scale beyond a single driver node which limits the processing power that can be applied. This subsequently limits how many full bandwidth streams can be supported. We chose to have a single server process on the target system (containing either the SPARC T4 processor or the Intel Xeon processor) and to spawn one or more Netperf client processes each across a cluster of the driver systems. The client processes are managed by the mpirun program of the Oracle Message Passing Toolkit.

Tabular results include aggregate bandwidth and CPU utilization. The aggregate bandwidth is computed by dividing the total traffic of the client processes by the overall runtime. CPU utilization on the target system is the average of that reported by all of the Netperf client processes.

IPsec is configured in the operating system of each participating server transparently to Netperf and applied to the dedicated network connecting the target system to the driver systems.

Key Points and Best Practices

  • Line speed is defined as data bandwidth within 10% of theoretical maximum bit rate of network line. For 10 GbE greater than 9000 Mb/sec bandwidth is defined as line speed.

  • IPsec provides network security that is configured completely in the operating system and is transparent to the application.

  • Peak bandwidths under IPsec are achieved only in aggregate with multiple client network streams to the target server.

  • Oracle Solaris receiver fanout must be increased from the default to support the large numbers of streams at quoted peak rates.

  • The ixgbe network driver relevant on servers with Intel 82599 10GbE controllers (driver systems and Intel Xeon target system) was limited to only a single receiver queue to maximize utilization of extra fanout.

  • IPsec is configured to make a unique security association (SA) for each connection to avoid a bottleneck over the large stream counts.

  • Jumbo frames are enabled (MTU of 9000) and network interrupt blanking (sometimes called interrupt coalescence) is disabled.

  • The TCP streaming bandwidth tests, which run continuously for minutes and multiple times to determine statistical significance, are configured to use message sizes of 1,048,576 bytes.

  • IPsec configuration defines that each SA is established through the use of a preshared key and Internet Key Exchange (IKE).

  • IPsec encryption uses the Solaris Cryptographic Framework which applies the appropriate accelerated provider on both the SPARC T4 processor and the Intel Xeon processor.

  • There is no need to configure a specific authentication algorithm for IPsec. With the Encapsulated Security Payload (ESP) security protocol and choosing AES-256-CCM for the encryption algorithm, the encapsulation is self-authenticating.

See Also

Disclosure Statement

Copyright 2011, Oracle and/or its affiliates. All rights reserved. Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Results as of 9/26/2011.

About

BestPerf is the source of Oracle performance expertise. In this blog, Oracle's Strategic Applications Engineering group explores Oracle's performance results and shares best practices learned from working on Enterprise-wide Applications.

Index Pages
Search

Archives
« June 2016
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today