Optimizing Byte Swapping for Fun and Profit

Performance Improvements in OpenSolaris 2008.11

I recently pushed source changes to OpenSolaris build NV99 to improve byte swapping performance. These changes will be a part of OpenSolaris 2008.11.

To review, byte swapping reverses the order of bytes in a integer (whether 2-, 4-, or 8-byte lengths). This is necessary as x86 processors store the low order byte of an integer first ("little endian"), and SPARC processors store the high-order byte first ("big endian"). Furthermore, data on the network is usually transmitted in big endian order and needs to be translated on little endian machines. As to which "byte endianness" (aka "byte sex") is better, this has been a subject of heated "religious" debates (see Cary and Cohen in references).

Byte 1
Byte 2
Byte 3
Byte 4
Byte 5
Byte 6
Byte 7
Byte 8
Byte 8
Byte 7
Byte 6
Byte 5
Byte 4
Byte 3
Byte 2
Byte 1

Byte swapping
Bite swapping with Holly
Bite swapping
Don't confuse byte swapping with bite swapping

Source Code Changes

My changes are as to:

  • Add htonll(3SOCKET) and ntohll(3SOCKET) functions. These functions are identical and do nothing on SPARC systems. On x86 systems, they swap the byte order They are also implemented on Linux, IBM AIX, and \*BSD systems.
  • Optimize the ntohll()/htonll() functions with inline x86 assembly. Since byte swapping is used frequently in data transmission, the x86 instruction set has a BSWAP instruction to quickly swap the data. This optimization is to use this instruction inline in C programs for 64-bit byte swaps. For example:
    / amd64.il
    / uint64_t htonll(uint64_t hostlonglong);
    .inline htonll,4
    movq    %rdi, %rax      / copy parameter 1 to return value
    bswapq  %rax		/ byte swap return value
  • Add external assembly ntohll()/htonll() functions. This is for sloppy C source that didn't include header file byteswap.h or netinet/in.h (they lose inline performance, but still get gains from using assembly).
  • Do the same for 32-bit instruction set object as for 64-bit object
  • Do the same for the GNU compiler and assembler as the Sun C compiler and assembler
  • Optimize the /usr/include/sys/byteorder.h header file byte swap macros BSWAP_32, BSWAP_64, and related macros BE_\*, BE_IN\*, and BE_OUT\*.

Performance Benefits

What were the performance benefits? Refer to the chart below. On the upper left, you can see performance improvement with BSWAP_32 and BSWAP_64 on X86-64 bit class systems. The most dramatic was for BSWAP_32 running 32-bit object, but every category showed improvement.

byte swapping on Solaris performance chart
Legend: "AMD64 32b" is 32-bit binary running on AMD64. "EM64T 32b" is 32-bit binary running on Intel EM64T. "AMD64 64b" and "EM64T 64b" similarly are 64-bit binaries running on AMD64 and EM64T, respectively. Time is in microseconds using a microbenchmark (100 million function calls in a loop).

Next, refer to the bottom half of the chart. This shows X86-64 performance improvements with various byte swapping macros. This is from substituting inline assembly for the LE_\* and BE_\* byte swapping macros (LE for Little Endian and BE for Big Endian). Performance for the LE_IN32 macros were marginal or negative, so I left them unchanged (that is, they remain implemented as C << and >> shift operations). However, improvements for the LE_\*64 and the BE_\*64 macros showed consistent improvement and these are now implemented in inline assembly.

Even SPARC optimization was possible (see the top right chart). This was done by rewriting the BSWAP32 and BSWAP64 macros—not in assembly, but more-efficient C. Consider this BSWAP_64 macro definition:

#define BSWAP_64(x)     ((BSWAP_32(x) << 32) | BSWAP_32((x) >> 32))
The BSWAP_64 definition looks straightforward and innocent, but it's very inefficient. BSWAP_64 expands to BSWAP_32, then BSWAP_16, then BSWAP_8. The full macro expansion turns out to be this frightening monster, suitable only for Halloween \* :
#define	BSWAP_64(x)     ((((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) |
 (((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) << 32) |
 ((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) |
 (((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) >> 32)
I rewrote it to this a less elegant, but faster implementation:
#define	BSWAP_64(x)     (((uint64_t)(x) << 56) | \\
                        (((uint64_t)(x) << 40) & 0xff000000000000ULL) | \\
                        (((uint64_t)(x) << 24) & 0xff0000000000ULL) | \\
                        (((uint64_t)(x) << 8)  & 0xff00000000ULL) | \\
                        (((uint64_t)(x) >> 8)  & 0xff000000ULL) | \\
                        (((uint64_t)(x) >> 24) & 0xff0000ULL) | \\
                        (((uint64_t)(x) >> 40) & 0xff00ULL) | \\
                        ((uint64_t)(x)  >> 56))


<script type="text/javascript" src="http://platform.twitter.com/widgets.js"></script>
<script src="http://connect.facebook.net/en_US/all.js#xfbml=1"></script>

Will these improvements be used by ZFS for its adaptive endianness code?

Posted by Derek Morr on October 31, 2008 at 08:16 AM PDT #

It should in theory. In practice, I had difficulty observing the performance improvement. I used zons created on x86 and sparc, and exported/imported copied zfs pool images between x86/sparc. I saw performance improvement of 4-5% real time in reading filesystem metadata (find . -exec ls) on amd64, but none on Intel em64t. Little or no difference on sparc. This was images copied to swapfs (/tmp). Maybe it's not a big factor or I measured wrong. Please see details at this thread:


Posted by Daniel Anderson on October 31, 2008 at 09:07 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed

Solaris cryptography and optimization.


« July 2015