X

News, tips, partners, and perspectives for the Oracle Solaris operating system

Optimizing Byte Swapping for Fun and Profit

Performance Improvements in OpenSolaris 2008.11

I recently pushed source changes to OpenSolaris build NV99 to improve byte swapping performance.
These changes will be a part of OpenSolaris 2008.11.

To review, byte swapping reverses the order of bytes in a integer (whether 2-, 4-, or 8-byte lengths).
This is necessary as x86 processors store the low order byte of an integer first ("little endian"), and SPARC processors store the high-order byte first ("big endian").
Furthermore, data on the network is usually transmitted in big endian order and needs to be translated on little endian machines.
As to which "byte endianness" (aka "byte sex") is better,
this has been a subject of heated "religious" debates
(see Cary and Cohen in references).
















Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

Byte 7

Byte 8







Byte 8

Byte 7

Byte 6

Byte 5

Byte 4

Byte 3

Byte 2

Byte 1





Byte swapping

alt="Bite swapping with Holly" width="470" height="340" border="0" />



Bite swapping

Don't confuse byte swapping with bite swapping



Source Code Changes

My changes are as to:


  • Add htonll(3SOCKET) and ntohll(3SOCKET) functions.
    These functions are identical and do nothing on SPARC systems.
    On x86 systems, they swap the byte order
    They are also implemented on Linux, IBM AIX, and \*BSD systems.
  • Optimize the ntohll()/htonll() functions with inline x86 assembly.
    Since byte swapping is used frequently in data transmission, the x86 instruction set has a BSWAP instruction to quickly swap the data.
    This optimization is to use this instruction inline in C programs for 64-bit byte swaps.
    For example:




    / amd64.il
    / uint64_t htonll(uint64_t hostlonglong);
    .inline htonll,4
    movq %rdi, %rax / copy parameter 1 to return value
    bswapq %rax

    / byte swap return value
    .end



  • Add external assembly ntohll()/htonll() functions.
    This is for sloppy C source that didn't include header file byteswap.h or netinet/in.h
    (they lose inline performance, but still get gains from using assembly).

  • Do the same for 32-bit instruction set object as for 64-bit object

  • Do the same for the GNU compiler and assembler as the Sun C compiler and assembler

  • Optimize the /usr/include/sys/byteorder.h header file byte swap macros BSWAP_32, BSWAP_64, and related macros BE_\*, BE_IN\*, and BE_OUT\*.

Performance Benefits

What were the performance benefits?
Refer to the chart below.
On the upper left, you can see performance improvement with BSWAP_32 and BSWAP_64 on X86-64 bit class systems.
The most dramatic was for BSWAP_32 running 32-bit object, but every category showed improvement.



byte swapping on Solaris performance chart


Legend: "AMD64 32b" is 32-bit binary running on AMD64. "EM64T 32b" is 32-bit binary running on Intel EM64T. "AMD64 64b" and "EM64T 64b" similarly are 64-bit binaries running on AMD64 and EM64T, respectively.
Time is in microseconds using a microbenchmark (100 million function calls in a loop).

Next, refer to the bottom half of the chart.
This shows X86-64 performance improvements
with various byte swapping macros.
This is from substituting inline assembly for the LE_\* and BE_\* byte swapping macros (LE for Little Endian and BE for Big Endian). Performance for the LE_IN32 macros were marginal or negative, so I left them unchanged (that is, they remain implemented as C << and >> shift operations).
However, improvements for the LE_\*64 and the BE_\*64 macros showed consistent improvement and these are now implemented in inline assembly.

Even SPARC optimization was possible (see the top right chart).
This was done by rewriting the BSWAP32 and BSWAP64 macros—not in assembly, but more-efficient C.
Consider this BSWAP_64 macro definition:





#define BSWAP_64(x)     ((BSWAP_32(x) << 32) | BSWAP_32((x) >> 32))


The BSWAP_64 definition looks straightforward and innocent, but it's very inefficient.
BSWAP_64 expands to BSWAP_32, then BSWAP_16, then BSWAP_8.
The full macro expansion turns out to be this frightening monster, suitable only for Halloween
\* :




#define

BSWAP_64(x) ((((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) |
(((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) << 32) |
((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) |
(((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) >> 32)


I rewrote it to this a less elegant, but faster implementation:




#define

BSWAP_64(x) (((uint64_t)(x) << 56) | \\
(((uint64_t)(x) << 40) & 0xff000000000000ULL) | \\
(((uint64_t)(x) << 24) & 0xff0000000000ULL) | \\
(((uint64_t)(x) << 8) & 0xff00000000ULL) | \\
(((uint64_t)(x) >> 8) & 0xff000000ULL) | \\
(((uint64_t)(x) >> 24) & 0xff0000ULL) | \\
(((uint64_t)(x) >> 40) & 0xff00ULL) | \\
((uint64_t)(x) >> 56))


References

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services