Improving RC4 encrypt performance.

RC4 is one of the best-known and commonly used ciphers. It is a very small and simple algorithm which can be executed very fast. The RC4 encrypt routine (as was in OpenSolaris) is the following:

Inputs: in (input stream), out: (output stream), key: (RC4 key), len: length of the streams.

Version 1
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
++i;
ti = arr[i];
j = j + ti;
arr [i] = arr[j];
arr[j] = ti;
out[ii] = in[ii] \^ arr[(arr[i] + arr[j]) & 0xff]; /\* Line 1 \*/
}

We went through an interesting exercise trying to optimize this loop. We analyzed the performance of this loop using er_kernel/ analyzer tools available in Sun Studio 12. Here is an output from analyzer which shows a snapshot from analyzer depicting the stalls (marked in green) from Line 1. Two stalls are observed during load operations, one before the arr[i] + arr[j] add in Line 1, and the other before the XOR operation in Line 1.

Looking through RC4 encrypt code available in OpenSSL (rc4_enc.c), we observed that they had the steps a little differently. Adopting that into OpenSolaris, the code would be:

Version 2
i = key->i;
j = key->j;
arr = key->arr;
for (ii = 0; ii < len; ++ii) {
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr [i] = tj;
out[ii] = in[ii] \^ arr[(ti + tj) & 0xff]; /\* Line 1 \*/
}

This version stores the contents of arr[i] and arr[j] in the initial loads, and re-uses them in Line 1. The compiler was not able to recognize and do the same automatically for us because arr is a pointer reference. Analyzing this code with er_kernel/ analyzer, we get the following output:

Now the stalls have moved to the XOR step. The XOR depends on arr[(ti + tj) & 0xff] to be computed, before the same may be executed. This leads to the stall.

We can re-arrange the loop in the following way in order to avoid the stall. This code is now available at OpenSolaris.org and in the latest version of Solaris Express Community Edition (Nevada) available .

Version 3
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;
arr_ij = arr[(ti + tj) & 0xff];
--len;

for (ii = 0; ii < len;) {
++i;
ti = arr[i];
j = j + ti;
tj = arr[j];
arr[j] = ti;
arr[i] = tj;

/\* save result from previous loop: \*/
out[ii] = in[ii] \^ arr_ij;

++ii;
arr_ij = arr[(ti + tj) & 0xff];
}
/\* save result from last loop: \*/
out[ii] = in[ii] \^ arr_ij;

This code spaces out the computation of arr_ij and the XOR operations, thereby avoiding the stall and allowing for more pipelining. Analyzing this code in er_kernel, we can observe that the stalls for the XORs are absent. There are still a couple of stalls, which are hard to avoid because of the dependency between the steps of RC4 encrypt. Here is the output.

The following graph shows the improvement that we get for RC4 encrypt from Version 1 to Version 3 on a Sun Fire X4150 server. It is interesting to observe that a 15% performance gain can be achieved by making the code more efficient to avoid stalls.

Comments:

Links to the snapshots are broken.

What is the effect of this change on the performance on sparc?

Posted by marc glisse on March 24, 2009 at 06:41 AM PDT #

There were errors from the permalink. Links should now be available.

It would not make a difference on CMT, since CMT has a crypto hardware accelerator. We are yet to measure the impact on other sparc systems, but we expect it to be of the same order.

Posted by Amitabha Banerjee on March 24, 2009 at 07:04 AM PDT #

Excellent work! Thanks for taking time to explain the code optimization techniques you used.

Posted by neel on March 25, 2009 at 12:02 PM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

This blog discusses my work as a performance engineer at Sun Microsystems. It touches upon key topics of performance issues in operating systems and the Solaris Networking stack.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today