Thursday Sep 13, 2007

Simple example of using RSA acceleration from OpenSSL

In the OpenSSL demos/sign subdirectory there is a simple demo code (sign.c), that signs and verifies a short message, leveraging RSA.

The modifications required in order to offload the RSA operations to the accelerator are fairly simple. At the start of main, the following is required to instruct OpenSSL to leverage the PKCS11 engine:

  ENGINE \*e;

  ENGINE_load_builtin_engines();
  e = ENGINE_by_id("pkcs11");
  if(!e) exit(1);
  ENGINE_set_default_RSA(e);

[For reference, the modified application can be found here]

Its also necessary to leverage the version of the OpenSSL which ships with Solaris:

cc -fast -I /usr/sfw/include -L /usr/sfw/lib -lcrypto sign.c -o sign.out

You can check to ensure that the HW accelerators where utilized via kstat:

kstat -m ncp | grep rsa

If you check the counters before running the test:

 kstat -m ncp | grep rsa
	rsaprivate                      33003
	rsapublic                       5

and after running the test:

  kstat -m ncp | grep rsa
	rsaprivate                      33004
	rsapublic                       6

it is apparent that both the sign and verify operations where offloaded to the HW accelerators.

Basically, as long as you as using the EVP_ functions, rather than using the low-level OpenSSL functions directly, it is a simple matter to modify an application to use the accelerators.

Wednesday Sep 12, 2007

UltraSPARC T2 crypto provider device drivers

The UltraSPARC T2 hardware crypto features are exposed via 3 drivers in Solaris:

1) ncp - similar to the UltraSPARC T1; handles RSA, DSA, DH and ECC (more details can be found here)

2) n2cp - handles bulk ciphers and hashes [more details can be found here (the supported modes of operation are also detailed)]

3) n2rng - access to the HW random number generator (more details can be found here )

Tuesday Sep 11, 2007

Detailed T2 crypto info

Very detailed info on the UltraSPARC T2 cryptographic accelerators can be found here on the OpenSPARC website (the pertinent info can be found in chapter-21 of the doc)

Sunday Sep 02, 2007

Power5 crypto performance comparison

In a previous post, just to illustrate how traditional processors compare to the UltraSPARC T2, I posted OpenSSL results for Clovertown and Opteron processors. To broaden the comparison, here's the same data for the Power5 processor:

On a 1.9GHz P5, each core seems capable of:
 RSA-1024 (sign)   : 275 Ops/sec
 AES-128-cbc (8KB) : 80 MB/sec

I would expect the 4.6GHz Power6 to do better, but even if one (generously) assumes linear scaling, its still a drop in the ocean compared to the per core per-performance of the UltraSPARC T2:

RSA-1024 (sign)   :  4.6K Ops/sec
AES-128-cbc (8KB) :  640MB/sec

Monday Aug 27, 2007

VictoriaFalls & Crypto

Interesting to see the recent presentation on the UltraSPARC VictoriaFalls (VF) CMP at this year's Hotchips. Unlike UltraSPARC T2, VF supports multichip SMP systems. The presentation itself can be found here.

The VF cores are leveraged from the UltraSPARC T2 processor and so will also support on-chip per-core cryptographic accelerators.

Thursday Aug 23, 2007

Using the UltraSPARC T2 crypto accelerators

Ease of use is central to ensuring widespread use of the UltraSPARC T2 cryptographic accelerators. With Solaris, we have tried to make the process of accessing the accelerators as seamless as possible;

Access to the UltraSPARC T2 accelerators from userland is controlled by the Solaris Cryptographic framework (SCF) and there are a variety of simple routes via which a userland application can offload to the accelerators:

Direct offload -; the SCF uses the PKCS#11 Cryptographic Token Interface (Cryptoki). In order to communicate directly with the SCF an application should leverage the PKCS#11 API. For PKCS#11 compliant applications, its then just a simple matter of linking with libpkcs11.so (located in /usr/lib). Given the fairly widespread use of the PKCS11 interface, especially with respect to traditional off-chip crypto accelerators, many applications already leverage PKCS#11. If an application doesn't already use the PKCS#11 interface, it is pretty straightforward to modify the application. A number of good docs about the SCF and developing simple PKCS#11 compliant applications can be found here and here.

OpenSSL Offload -; if the application uses OpenSSL (and many do), access to the accelerators can be achieved by linking with the OpenSSL libraries supplied with Solaris 10 (has the PKCS#11 engine built-in). These are located in /usr/sfw/lib:

cc -fast -I /usr/sfw/include -L /usr/sfw/lib -lcrypto aes_test.c -o aes_test.out

Additionally, it is necessary to force the use of the pkcs11 engine; this procedure is documented here. Something akin to the following does the trick:

ENGINE \*e;
ENGINE_load_builtin_engines();
e = ENGINE_by_id("pkcs11");
ENGINE_set_default_ciphers(e);
EVP_CIPHER_CTX_init (&ctx);
EVP_EncryptInit (&ctx, EVP_des_cbc (), key, iv);
EVP_EncryptUpdate (.....);

Java Offload -; for applications that utilize the Java Cryptographic Extensions (JCE), the application should simply be configured to utilize the SunPKCS11-Solaris provider in order to use the hardware accelerators on the T2 processor. Good Java security info tips can be found here.

Its also possible to access the accelerators via NSS, as described here.

This isn't a definitive guide to accessing the accelerators. I plan to have more details available going forward.



Monday Aug 20, 2007

T2 Crypto -- accelerator details

The first UltraSPARC processor with on-chip cryptographic accelerators was the UltraSPARC T1 processor; each of the processor's eight cores has an associated crypto accelerator that is targeted at offloading/accelerating public-key cryptography. Basically, this accelerator, termed the modular arithmetic unit (MAU), performs modular exponentiation operations that lie at the heart of algorithms such as RSA and Diffie-Hellman.

With the UltraSPARC T2 processor, each core's crypto accelerator retains its MAU unit, but is also enhanced by the introduction of a cipher/hash unit, that is focussed on accelerating symmetric ciphers (AES etc) and secure hashes (SHA etc).

On the T2, the two sub-units that constitute the accelerator can operate in parallel, such that each core's accelerator can be performing an RSA operation and an AES operation in parallel.

Communication with the cipher/hash unit is via a memory-based control word queue. To offload an operation to the accelerator, it is necessary to generate a control-word that provides the accelerator with the information required to perform the operation e.g. pointers to src, dst, keys, IVs. As a result, the accelerator is essentially stateless, which is extremely important in application spaces where there can be literally thousands of simultaneous connections (e.g. Secure Web, Secure VoIP). Additionally, given this light-weight interface, the overheads associated with offloading an operation to the accelerator can be extremely minimal, allowing even short duration operations to be cost effectively offloaded.

It is possible to interact with the accelerator in a synchronous or asynchronous manner, such that, if desired, it possible to go off and perform other useful processing on the core while the crypto operation is being performed in parallel on the accelerator; this provides an additional level of parallelism that is not achieved when ISA customization is used to achieve crypto acceleration.

Tuesday Aug 14, 2007

The move to ECC

Given the T2's support for ECC, it was interesting to see the following article, reiterating the need for a timely move to ECC:



Monday Aug 13, 2007

Crypto performance

Some brief experimentation with crypto performance on more traditional processors:

These numbers where obtained running the OpenSSL (0.9.8e) speed test, first using a single core, then using all of the cores.

Clovertown (Dell PowerEdge 2900) (2.66GHz)


Single thread
=============
rsa-1024 (sign)           1200 Ops/sec
aes-128-cbc               135MB/s

Maximum Throughput for single-processor (4-cores)
=================================================

rsa-1024 (sign)           4750 Ops/sec
aes-128-cbc               525MB/s


[Many thanks to Chi-Chang for collecting this data]

Now, it is probably possible to improve on these numbers. Certainly, OpenSSL may not contain the optimal RSA or AES implementation for Clovertown. However, it does give a ballpark estimate for the crypto performance that can be delivered by traditional processors, even those with Stellar single-thread performance.

Further, given this crypto processing is performed in SW, in order to achieve this level of performance, each and every cycle on each and every core is consumed performing the crypto processing, leaving no idle cycles to do anything meaningful with the data being generated.....

If we contrast this with the peak performance that can be delivered by the T2 crypto accelerators (see earlier post), it is apparent that there is a significant upside to supporting HW crypto accelerators, if you are interested in secure application performance. Further, with when using the T2 accelerators, because the crypto processing is offloaded to the HW, the cores still have idle cycles, which can be used to process the data that is being produced!




Friday Aug 10, 2007

UltraSPARC T2 Crypto performance

Detailed breakdown of the peak performance we can expect from the UltraSPARC T2 cryptographic accelerators.


Bulk Cipher


Algorithm

Bytes/cycle/SPU

Gb/s/chip @ 1.4GHz

RC4

1

83

DES

1

83

3DES

0.33

27

AES-128

0.53

44

AES-192

0.44

36

AES-256

0.38

31

[N.B. common modes of operation supported]



Secure Hash


Algorithm

Bytes/cycle/SPU

Gb/s/chip @ 1.4GHz

MD5

0.5

41

SHA-1

0.4

32

SHA-256

0.5

41


Public-key


Algorithm

Ops/cycle/SPU

Ops/s/chip @ 1.4GHz

RSA-1024

4625

37000

RSA-2048

800

6400

ECCp-160

6540

52300

ECCb-163

11500

92100


Wednesday Aug 08, 2007

Zero-cost security?

In today's environment, security is becoming ever more essential, whether we be talking about web servers, databases, file systems or networking. However, the high cost associated with security is problematic; if I have a system that is capable of performing X operations per second when running in an non-secure mode, when I flip that metaphorical switch and go secure, the throughput of operations that the system can sustain will fall drastically. 2X slowdowns are commonplace and 5X, or even 10X, slowdowns are not that uncommon.

As a result of this high cost, there is often significant reluctance to develop and deploy the comprehensive security strategies that are required in today's world; leading to the serious consequences that we read about all too frequently.

So what is typically done to remedy this situation?

If you look at the security overheads, the vast majority of the overhead is frequently attributable to the cryptographic operations that underpin the security protocols. However, general purpose processors are ill suited to performing cryptographic operations. As a result, we often try to offload the cryptographic processing to custom hardware that can perform the operations orders of magnitude faster than can be achieved on the processor.

Accordingly, accelerators should allow us to convert the significant security overheads into virtually negligible overheads. Essentially, accelerators should allow us to achieve zero cost security! (by which I mean that there should be a negligible performance impact associated with going secure).

Unfortunately, accelerators have largely failed to deliver on this.

This is basically a result of the way we have architected and deployed accelerators; we have a system, and then, almost as an afterthought, we add in the PCI-based accelerator card. With this architecture, the cost of offloading an operation to the accelerator can be very high, significantly limiting the type of cryptographic operation that can be cost effectively offloaded; its frequently more cost effective to just perform the processing on the processor!

With the UltraSPARC T2 processor, we have moved the crypto accelerators on-chip and tightly coupled them with the processor cores. As a result, it has been possible to radically reduce the overheads associated with offloading an operation to the accelerators. In turn, this allows the T2 accelerators to cost effectively handle a much broader range of cryptographic operations than traditional offchip accelerators and enables the UltraSPARC T2 processor to deliver zero cost security in a wide variety of application spaces.

Tuesday Aug 07, 2007

UltraSPARC T2 processor -- world class crypto performance

With today's launch of the UltraSPARC T2 processor, here's the link to my recent podcast on the world-class cryptographic performance that is delivered by the UltraSPARC T2 processor, thanks to its on-chip, tighly-coupled, hardware accelerators:


http://frsun.downloads.edgesuite.net/sun/08A01108/08A01108_01.mp3


About

Dr. Spracklen is a senior staff engineer in the Architecture Technology Group (Sun Microelectronics), that is focused on architecting and modeling next-generation SPARC processors. His current focus is hardware accelerators.

Search

Top Tags
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today