Monday Sep 15, 2008

SSH (& scp etc) gets faster on T2 processors

Great to see from Jan's recent blog entry that SunSSH has been enhanced to take advantage of the UltraSPARC T2 hardware cryptographic accelerators – see here for more details.

I will spend some time playing with this later this week and report more generally on the performance benefits I observe

Friday Sep 05, 2008

SSE5 & AES?

Recent discussion about whether AMD's upcoming SSE5 instructions can be used to significantly accelerate (5X) AES can be seen here. Given they don't seem to provide dedicated AES instructions, its tricky to see how this can be readily achieved -- especially given for AES-CBC even special purpose AES instructions only provide around 6X.....

Any thoughts?



Wednesday Aug 13, 2008

T2/T2+ busstat

The busstat tool can be a useful performance tuning aid, allowing one to drill into the load an application is placing on the memory sub-system. However, one note of caution, on the T2/T2+ the bank_busy_stalls counters should not be used, as erroneous results are returned – makes it looks like the application is causing bank busy stalls, even when this is not the case. Future revs of busstat are aware of this, but in the interim, this is a performance counter to ignore when tuning your app.

Friday Jul 11, 2008

Elliptic Curve Cryptography (ECC) performance

On a recent rev of Nevada, I just ran some ECC (elliptic curve cryptography) ubenchmarks, comparing a UltraSPARC T2 using the HW crypto accelerators and a 3GHz Xeon:




These numbers are for ecdsa operations. The Xeon #s are from openSSL speed (optimized compilation), while the T2 #s are generating from interacting directly with the framework. These numbers are for binary curves using Galois field operations.

High performance software crypto on CMT processors

While cryptography is typically viewed as computationally intensive (and so less well suited to CMT processors), software implementations of common cryptographic algorithms can be readily optimized to excel on CMT processors. Current software implementations have been optimized for traditional processors, with multiple lookup tables sized to all fit in a processors small level-1 caches. It is this use of multiple small tables that leads to the high computational overheads associated with most cryptographic implementations -- due to the significant arithmetic operations needed to manipulate access to the tables and recombine the results from the various tables.

For example, consider the Kasumi algorithm, which is essential to 3G mobile telephony. In Kasumi, a block is 8-bytes, the key is 128-bits (although it is expanded to a 1024-bit key schedule before use), and processing consists of 8 rounds per block. While a variety of operations are performed per block, the most costly operation is termed FI and consists of the following (in C notation):

nine = (u16)(in>>7);
seven = (u16)(in&0x7F);
nine = (u16)(S9[nine] \^ seven);
seven = (u16)(S7[seven] \^ (nine & 0x7F));

seven \^= (subkey>>9);
nine \^= (subkey&0x1FF);
nine = (u16)(S9[nine] \^ seven);
seven = (u16)(S7[seven] \^ (nine & 0x7F));
in = (u16)((seven<<9) + nine);
return( in );

where in and subkey are two-byte variables, S9 is a 512-element lookup table and S7 is a 128-element lookup table. This operation is performed three times per round, for a total of 24 times per block. Each FI operation requires 22 instructions (for SPARC), for a total of 576 FI-derived instructions per block. Given the abundance of logical and shift operations, it is apparent that superscalar processors will perform this function very well, with an Instructions Per Cycle (IPC) of 2.5 or more. In contrast, the Niagara single-strand IPC is around 0.65. Further, due to the compute intensive nature of the code, the single Niagara strand uses around two thirds of the processor core's issue resources. As a result, performance does not scale as additional Kasumi threads run on a core.

To overcome this problem, the implementation can be optimized to radically reduce the instruction count. A reduction in instruction count may be achieved by replacing large parts of the FI function using a large lookup table. In the original Kasumi code, the 16-bit elements are divided into two smaller elements, one 7-bits and one 9-bits. These smaller elements are processed independently and the results combined. While this ensures that the lookup tables are small, significant logical and arithmetic operations are required to split the 16-bit elements and later recombine the smaller 7-bit and 9-bit elements back into the 16-bit elements. Significant computational saving may be achieved by processing an entire 16-bit element at once, using large lookup tables, as shown below:


t0 = LT0[in];
t0 = t0 \^ subkey;
in = LT1[t0];


The new lookup tables (LT0 and LT1) are now much larger, each being composed of 65536 2-byte elements. Note that the lookup tables are constant, may be precomputed, and are independent of the keys. However, using this approach, the FI function now only requires five instructions, a four times reduction from the original implementation. Further note that in both the optimized and the original code, the lookup table accesses are dependent and cannot be performed in parallel or prefetched in advance.


The lookup tables that once fitted in the L1 cache are now much larger and will now largely reside in the L2 cache -- this instruction count reduction has been at the expense of increased memory stalls, but here we are laying to the strengths of a CMT processor. As a result, it would appear that the performance of the code will remain largely unchanged, having traded increased instruction count for increased memory stalls. This optimization technique is beneficial for at least two reasons. First, MT (multithreading) performance is improved. For the initial implementation, due to the large computational requirements of the algorithm, as additional strands are leveraged, aggregate core performance improves very little. Given that a single strand is capable of consuming almost all of a processor core’s resources, as additional VT/SMT strands are leveraged, these strands rapidly start to deprive the other strands of resources, and the aggregate core performance is improved very little. In contrast, in the optimized version, the strands spend most of their time stalled waiting for accesses to the lookup tables to complete and consume a much smaller fraction of a processor core’s resources. As a result, as the number of strands is increased, performance scales almost linearly. Indeed, for Niagara, per-core Kasumi performance is around 8 times the performance of a single strand, and per-chip Kasumi performance is close to 64X single-strand performance. Indeed, single-core Kasumi performance is around 1.3X the performance of a single-core of a 3GHz Xeon processor.


Monday Jun 30, 2008

Crypto wiki

I've been gradually expanding the crypto wiki (which can be found here); adding additional info and some code examples. Please let me know what additional information would be useful to add, how the wiki could be improved, and even add your own thoughts....



Monday Jun 23, 2008

High-performance SHA-1

In my recent CommunityOne Microparallelism presentation, one of the cases studies discusses how to convert high ILP code on superscalar processors into the TLP implementations on CMT processors. The case study is discussed with reference to the SPARC implementation of SHA-1, which I wrote several years ago. The code, tuned for sun4u processors, can actually be found in OpenSolaris here. The message expansion portion of the SHA-1 computation is performed in parallel with the compression function portion using the VIS instructions. The SIMD nature of the VIS instructions is not leveraged, merely the fact that they allow integer operations to be performed on the FP pipelines. As a result, the IPC on a UltraSPARC IV+ processor is increased from around 2 to almost 4 -- improving performance by over 1.7X...


On CMT processors, such as T2, this doesn't deliver optimal performance. However, given the low inter-thread synchronization costs, one can consider performing these two portions of the SHA-1 computation using two threads:

Wednesday Jun 04, 2008

Crypto performance wiki

I've started a wiki to capture the more pertinent info on UltraSPARC crypto performance in a more organized form.



Thursday May 29, 2008

Using the UltraSPARC hardware cryptographic accelerators


A brief synopsis of how to leverage the UltraSPARC hardware cryptographic accelerators from your application.


Introduction


Sun's UltraSPARC T1, T2 and T2Plus processors support high-performance hardware cryptographic accelerators on chip. These accelerators can significantly reduce the normally significant overheads associated with cryptography and secure operation.

On the UltraSPARC T1, T2 and T2plus processors, there is a cryptographic accelerator per each core, such that an 8-core processor provides 8 accelerators. The algorithms supported by these accelerators vary with processor and are illustrated in the following table:


Algorithm

UltraSPARC T1

UltraSPARC T2/T2Plus

Public-key algorithms

RSA, DSA, DH

ECC, ECDSA, ECDHA

X

Symmetric algorithms

RC4

X

DES, 3DES

X

AES-{128,192,256}

X

Cryptographic hashes

MD5

X

SHA-1

X

SHA-224/256

X


The public-key operations are performed by the accelerator's modular arithmetic unit, while symmetric cipher and cryptographic hash operations are performed by the accelerator's cipher and hash unit (CHU). The UltraSPARC T1 accelerators are composed of just a MAU, while the UltraSPARC T2/T2plus accelerators have both MAU and CHU, both of which can operate in parallel. The accelerators operate at the core frequency (in parallel with the core) and are capable of delivering cryptographic performance that is typically an order of magnitude better than can be achieved on traditional processors in software, as is illustrated in the following table:


Algorithm

UltraSPARC T1 (1.2GHz)

UltraSPARC T2/T2Plus (1.4GHz)

RSA-1024

20,000 sign operations/sec/chip (8-core)

37,000 sign operations/sec/chip (8-core)

AES-128-CBC

X

44Gb/s/chip (8-core)

SHA-1

X

32Gb/s/chip (8-core)


This article describes how to code your application such that it can leverage these hardware accelerators. Many important applications will already leverage the UltraSPARC hardware accelerators, either directly out-of-the-box or with minimal configuration. These include; the Sun Studio webserver, the Apache webserver, KSSL and IPsec to name but a few. More details of how to configure these applications are provided in a Sun cryptographic blueprint [1].


Using the UltraSPARC hardware cryptographic accelerators

Access to the cryptographic accelerators is controlled by the Solaris Cryptographic Framework. For non-privileged applications, access is via the userland cryptographic framework (UCF), while for kernel modules (such as KSSL or IPsec) access is via the kernel cryptographic framework (KCF). This article focuses on the userland cryptographic framework.

The Userland Cryptographic Framework exposes a PKCS#11 [2] compliant API to non-priv userland applications. Applications can interact directly with the UCF via the PKCS#11 interface, or indirectly via:

    • Java Cryptographic Framework (JCE)

    • OpenSSL

    • Network Security Services (NSS)

The remainder of this article focuses on how to interact with the UCF directly and indirectly via JCE, OpenSSL and NSS.


Direct interaction with UCF

For PKCS#11 compliant applications, libpkcs11.so is the gateway to the UCF, and its just a simple matter of linking against this library [located in /usr/lib]. Given the fairly widespread use of the PKCS11 interface, especially with respect to traditional off-chip cryptographic accelerators (such as Sun's SCA6000 card), many applications already leverage PKCS#11. If an application doesn't already use the PKCS#11 interface, it is pretty straightforward to modify the application, with documents showing example implementations readily available [3].


Offload via OpenSSL

If the application uses OpenSSL for its cryptographic requirements (and many do), access to the accelerators can be achieved by using a version of OpenSSL that has been modified to support the PKCS#11 engine. A patched version of OpenSSL is supplied with Solaris 10 and is located in /usr/sfw/lib, allowing application compilation as follows:


cc -fast -I /usr/sfw/include -L /usr/sfw/lib -lcrypto aes_test.c -o aes_test.out


For operations that are to be offloaded, it is necessary to restrict use to the EVP_ functions and explicitly indicate the use of the PKCS11 engine; something like the following works for bulk ciphers (the process for RSA is similar):


ENGINE \*e;

ENGINE_load_builtin_engines();

e = ENGINE_by_id("pkcs11");

ENGINE_set_default_ciphers(e);

EVP_CIPHER_CTX_init (&ctx);

EVP_EncryptInit (&ctx, EVP_des_cbc (), key, iv);

EVP_EncryptUpdate (.....);


PKCS#11 engine patches are available from OpenSSL.org for a number of different versions of OpenSSL, if the version of OpenSSL that ships with Solaris isn't suitable [4].


Offload via JCE

For applications that utilize the Java Cryptographic Extensions (JCE), the application should simply be configured to utilize the SunPKCS11-Solaris provider. Accordingly, in order for applications to use the hardware accelerators automatically, it is just necessary to ensure that sun.security.pkcs11.SunPKCS11 is configured as the first provider in $JAVA_HOME/jre/lib/security/java.security file.


The SunPKCS11-Solaris provider can also be explicitly selected as follows:


String provider = "SunPKCS11-Solaris";

Cipher aescipher = Cipher.getInstance("AES/ECB/NoPadding", provider);


It should be noted that the SunPKCS11-Solaris provider currently only offloads a subset of the chaining modes supported by the hardware, so make sure that the chaining mode and padding mode are supported [5]. The modes supported by the hardware accelerators are illustrated in the following table:


Cipher

Supported chaining modes

AES

ECB, CBC, CTR

DES/3DES

ECB, CBC, CFB64



Offloading via NSS

In order for NSS to use the hardware cryptographic accelerators, the Solaris cryptographic framework should be added as a provider for NSS. This is achieved by modifying the appropriate NSS security databases. As an example, the following illustrates how firefox can offload RSA operations to the hardware:


/usr/sfw/bin/modutil -dbdir /home/sprack/.mozilla/firefox/r5s548iw.default/ -add "Solaris Crypto Framework" -libfile /usr/lib/libpkcs11.so -mechanisms RSA

/usr/sfw/bin/modutil -dbdir /home/sprack/.mozilla/firefox/r5s548iw.default/ -enable "Solaris Crypto Framework"


The use of the mechanism option indicates that the Solaris Cryptographic Framework should be the default provider for RSA operations [6].


Observability

When operations are submitted to the cryptographic framework, the cryptographic framework will, as appropriate, route processing for these operations to the Niagara cryptographic provider (ncp) device driver for public-key operations, and the Niagara-2 cryptographic provider (n2cp) device driver for symmetric cipher and cryptographic hash operations. These device drivers then perform the actual offload to the hardware accelerators and return the results to the framework. The interaction between these drivers and the cryptographic frame is controlled via cryptoadm.

kstat can be used to provide insight into the cryptographic operations that ncp and n2cp are handling, as follows:


kstat -m ncp | less

kstat -m n2cp | less


Additionally, cputrack can be utilized to determine the activity of the hardware accelerators directly (use cputrack -h to determine which counters to track).


Concluding thoughts

Cryptographic processing overheads are finding their way into an ever wider array of applications as security becomes ever more important. By providing on-chip hardware cryptographic accelerators, the UltraSPARC processors can vastly reduce these overheads, and in many situations enable respectable performance even when operating securely.

Via the Cryptographic Framework Solaris provides a simple way via which applications can leverage the benefits of the UltraSPARC hardware accelerators, while continuing to ensure application portability



References


[1] Using the cryptographic accelerators in the UltraSPARC T1 and T2 processors

[2] PKCS #11: Cryptographic Token Interface Standard

[3] The Solaris cryptographic framework

[4] Miscellaneous OpenSSL Contributions

[5] Sun PKCS#11 Provider's Supported Algorithms

[6] Configuring Solaris Cryptographic Framework and Sun Java System Web Server 7 on Systems With UltraSPARC T1 Processors



Wednesday May 21, 2008

MySQL and UltraSPARC T2 crypto

I've started looking at how to leverage the UltraSPARC T2 hardware cryptographic accelerators to improve MySQL performance and there are a couple of interesting opportunities;


  1. SSL is used to secure communication between a potentially remote MySQL client and the MySQL server. One option is to modify the appropriate SSL libraries to use the T2 hardware accelerators where appropriate -- pretty straight forward. Another option that I'm currently investigating is trying to use the Solaris Kernel SSL proxy (KSSL). KSSL already uses the UltraSPARC T2 HW crypto accelerators, and so could be a very elegant solution to offloading MySQL SSL processing.

  2. A variety of operations are supported by MySQL to secure database information, such as aes_encrypt() and des_decrypt(). Support for DES and SHA1 are also provided. Again, it is fairly straight forward to modify this code to use the T2 hardware accelerators were appropriate.


More details/results to follow as I continue investigating.

Tuesday May 20, 2008

UltraSPARC T2plus (VF) crypto

Just been playing with crypto on a 2-way UltraSPARC T2plus (Victoria Falls) system. The system, with 16-cores running at 1.2GHz was running Nevada 89, and my crypto microbenchmarks scaled very nicely. Able to hit HW peak performance (~6.8GB/s/system with AES-128-CBC), with suitable object sizes and a bunch of threads. More details as time allows.



Wednesday May 14, 2008

Java and hardware cryptographic acceleration

I've just been experimenting with Java Cryptographic Framework (JCE) on the UltraSPARC T2 processor and it is important to remember which algorithms/modes/padding are supported for offload to the cryptographic hardware. While the UltraSPARC T2 processor supports most common chaining modes, offloads from JCE occur via the SunPKCS11-Solaris provider. The supported algorithms/modes/padding are somewhat more restrictive and are listed here. If a none supported mode is specified, the operation will not be offloaded to the HW, but will be performed in software.


If the SunPKCS11-Solaris provider is explicitly selected:


String provider = "SunPKCS11-Solaris";

Cipher aescipher = Cipher.getInstance("AES/ECB/NoPadding", provider);


then an exception is taken when a non supported mode is requested.

Thursday May 08, 2008

T2 HW crypto and Java

As stated in an earlier entry, when running on an UltraSPARC T2 processor, applications using the Java cryptographic extensions (JCE) should (when applicable) automatically leverage the on-chip cryptographic accelerators.

Following a recent conversation with a Java Guru, you should check the following, if you experience problems:

Java on Solaris automatically sets SunPKCS11-Solaris (which calls into
the Solaris Crypto Framework) as the default security provider, so you
need to do nothing.

This begins from some version of J2SE 5.0. You can go look at the
${java.home}/lib/security/java.security file. There should be one line
look like:

security.provider.1=sun.security.pkcs11.SunPKCS11
${java.home}/lib/security/sunpkcs11-solaris.cfg




Interesting article on using AES from Java can be found here

Wednesday Apr 30, 2008

T2 HW crypto and SPECweb2005

I typically witter on about crypto performance at the microbenchmark level, but I was recently browsing the SPECweb05 results and I was impressed to see how the T2 performs, especially on the Banking workload, which is 100% HTTPS:


Processor

SPECweb2005_Banking

1 x T2 [1.4GHz]

70,000

2 x Quad-core Opteron Processor (2356) [2.3GHz]

50,856

2 x Quad-core Xeon Processor X5460 [3.2GHz]

51,840

4 x Quad-core Xeon Processor X7350 [3.0GHz]

71,104


Intel 2-chip http://www.spec.org/web2005/results/res2008q1/web2005-20080225-00104.txt
Intel 4-chip http://www.spec.org/web2005/results/res2007q4/web2005-20071203-00101.html

Opteron http://www.spec.org/web2005/results/res2008q2/web2005-20080409-00107.txt
T2 http://www.spec.org/web2005/results/res2008q2/web2005-20080408-00105.txt


Pretty Impressive! So a single-socket UltraSPARC T2 processor provides equivalent performance to 4-socket x64 systems containing Quad-core processors! On a per socket basis, T2 outperforms the competition by over 2.7X!


Now, this performance leadership is not all down to the HW crypto support – I'm sure the onchip NICs, and abundance of threads help somewhat too. However, the cryptographic overheads associated with HTTPS are pretty significant – RSA ops for session establishment and then RC4 and MD5 (these are the algorithms used for SPECweb2005 anyway) operations to secure and authenticate the subsequent traffic. In fact, looking at the following figures:



Figure 1: Relative costs in an HTTPS transaction for different file sizes. Referenced from here






Figure 2: Typical breakdown of overheads for SPECweb2005 banking





it is apparent that a significant proportion of the total application-level overheads are associated with cryptographic processing. Its therefore not surprising that providing HW support to accelerate cryptographic processing provides a significant performance advantage to the UltraSPARC T2 processor on SPECweb05 banking...


Its nice to see that the good microbenchmark numbers actually translate into significant gains at an application level....

Monday Apr 14, 2008

Using Solaris softtoken keystore

The other day I was looking for a C code example of illustrating how to leverage the softtoken key store when directly interacting with the Solaris crypto framework. There's substantial documentation available but I couldn't find a basic example. So here's what I concocted:


1. Configure my softtoken keystore via the command line:


	pktool setpin keystore=pkcs11 
	pktool genkey label=test_key keytype=aes keylen=128 
	pktool list objtype=key 


where the first operation updates the passphrase required to access the keystore. If the keystore doesn't exist, the keystore is first created. The default passphrase is “changeme”. The second operation creates a 128-bit AES key and installs it in the keystore – the label associated with the key is “test_key”. The third operation displays the contents of the keystore, so it is possible to confirm that the key has been created correctly.


2. Use the AES key in the keystore from a application that is directly using the Solaris crytpto framework:


#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <security/cryptoki.h>
#include <security/pkcs11.h>

int
main()
{
  CK_RV rv;
  CK_ULONG found_keys;
  CK_MECHANISM mechanism;
  CK_OBJECT_HANDLE hKey, key_list[1];
  CK_SESSION_HANDLE hSession;
  CK_UTF8CHAR label[] = {"test_key"};

  unsigned char ivec[16];
  unsigned char userPIN[] = {"mykeystore"};

  mechanism.mechanism = CKM_AES_CBC;
  mechanism.pParameter = ivec;
  mechanism.ulParameterLen = 16;

  rv = C_Initialize(NULL);
  if (rv != CKR_OK) {
      fprintf(stdout, "C_Init: rv = 0x%.8X\\n", rv);
      exit(1);
  }

  /\*Use metaslot i.e. slot 0\*/
  rv = C_OpenSession(0, CKF_SERIAL_SESSION | CKF_RW_SESSION,
                     NULL, NULL, &hSession);
  if (rv != CKR_OK) {
      fprintf(stdout, "C_openSess: rv = 0x%.8X\\n", rv);
      exit(1);
  }

  /\*Log in using the correct passphrase\*/
  rv = C_Login(hSession, CKU_USER, userPIN, sizeof(userPIN));
  if (rv != CKR_OK) {
      fprintf(stdout, "C_Login: rv = 0x%.8X\\n", rv);
      exit(1);
  }

  /\* Get the key object, where lable is the label of the
   \* key we which to leverage
   \*/
  CK_ATTRIBUTE template[] = {
      {CKA_LABEL, label, sizeof(label)-1}
  };

  rv = C_FindObjectsInit(hSession, template,
                         sizeof (template) / sizeof (CK_ATTRIBUTE));
  if (rv != CKR_OK) {
      fprintf(stdout, "C_FindObjectsInit: rv = 0x%.8X\\n", rv);
      exit(1);
  }

  rv = C_FindObjects(hSession, key_list, 1, &found_keys);

  hKey = key_list[0];

  if (rv != CKR_OK) {
      fprintf(stdout, "C_FindObjects: rv = 0x%.8X\\n", rv);
      exit(1);
  }

  if (found_keys != 1)
  {
      fprintf(stdout, "C_FindObjects found %d objects\\n", found_keys);
      exit(1);
  }

  /\* Initialize the encryption operation in the session \*/
  rv = C_EncryptInit(hSession, &mechanism, hKey);
  if (rv != CKR_OK) {
      fprintf(stdout, "C_EncryptInit: rv = 0x%.8X\\n", rv);
      exit(1);
  }

  .
  .
  .


in the above example it is assumed that the phasephrase is set to “mykeystore”.


The keys stored in the Solaris softtoken keystore are encrypted and they are also checked for integrity. The PBKDF2 function defined in PKCS#5 is used for generating the keys from the passphrase.


About

Dr. Spracklen is a senior staff engineer in the Architecture Technology Group (Sun Microelectronics), that is focused on architecting and modeling next-generation SPARC processors. His current focus is hardware accelerators.

Search

Top Tags
Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today