Optimizing Byte Swapping for Fun and Profit
By Danx-Oracle on Oct 31, 2008
I recently pushed source changes to OpenSolaris build NV99 to improve byte swapping performance. These changes will be a part of OpenSolaris 2008.11.
To review, byte swapping reverses the order of bytes in a integer (whether 2-, 4-, or 8-byte lengths). This is necessary as x86 processors store the low order byte of an integer first ("little endian"), and SPARC processors store the high-order byte first ("big endian"). Furthermore, data on the network is usually transmitted in big endian order and needs to be translated on little endian machines. As to which "byte endianness" (aka "byte sex") is better, this has been a subject of heated "religious" debates (see Cary and Cohen in references).
|Don't confuse byte swapping with bite swapping|
Source Code Changes
My changes are as to:
- Add htonll(3SOCKET) and ntohll(3SOCKET) functions. These functions are identical and do nothing on SPARC systems. On x86 systems, they swap the byte order They are also implemented on Linux, IBM AIX, and \*BSD systems.
- Optimize the ntohll()/htonll() functions with inline x86 assembly.
Since byte swapping is used frequently in data transmission, the x86 instruction set has a BSWAP instruction to quickly swap the data.
This optimization is to use this instruction inline in C programs for 64-bit byte swaps.
/ amd64.il / uint64_t htonll(uint64_t hostlonglong); .inline htonll,4 movq %rdi, %rax / copy parameter 1 to return value bswapq %rax / byte swap return value .end
- Add external assembly ntohll()/htonll() functions. This is for sloppy C source that didn't include header file byteswap.h or netinet/in.h (they lose inline performance, but still get gains from using assembly).
- Do the same for 32-bit instruction set object as for 64-bit object
- Do the same for the GNU compiler and assembler as the Sun C compiler and assembler
- Optimize the /usr/include/sys/byteorder.h header file byte swap macros BSWAP_32, BSWAP_64, and related macros BE_\*, BE_IN\*, and BE_OUT\*.
What were the performance benefits? Refer to the chart below. On the upper left, you can see performance improvement with BSWAP_32 and BSWAP_64 on X86-64 bit class systems. The most dramatic was for BSWAP_32 running 32-bit object, but every category showed improvement.
Legend: "AMD64 32b" is 32-bit binary running on AMD64. "EM64T 32b" is 32-bit binary running on Intel EM64T. "AMD64 64b" and "EM64T 64b" similarly are 64-bit binaries running on AMD64 and EM64T, respectively. Time is in microseconds using a microbenchmark (100 million function calls in a loop).
Next, refer to the bottom half of the chart. This shows X86-64 performance improvements with various byte swapping macros. This is from substituting inline assembly for the LE_\* and BE_\* byte swapping macros (LE for Little Endian and BE for Big Endian). Performance for the LE_IN32 macros were marginal or negative, so I left them unchanged (that is, they remain implemented as C << and >> shift operations). However, improvements for the LE_\*64 and the BE_\*64 macros showed consistent improvement and these are now implemented in inline assembly.
Even SPARC optimization was possible (see the top right chart). This was done by rewriting the BSWAP32 and BSWAP64 macros—not in assembly, but more-efficient C. Consider this BSWAP_64 macro definition:
The BSWAP_64 definition looks straightforward and innocent, but it's very inefficient. BSWAP_64 expands to BSWAP_32, then BSWAP_16, then BSWAP_8. The full macro expansion turns out to be this frightening monster, suitable only for Halloween :
#define BSWAP_64(x) ((BSWAP_32(x) << 32) | BSWAP_32((x) >> 32))
I rewrote it to this a less elegant, but faster implementation:
#define BSWAP_64(x) ((((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) | (((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) << 32) | ((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) | (((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) >> 32)
#define BSWAP_64(x) (((uint64_t)(x) << 56) | \\ (((uint64_t)(x) << 40) & 0xff000000000000ULL) | \\ (((uint64_t)(x) << 24) & 0xff0000000000ULL) | \\ (((uint64_t)(x) << 8) & 0xff00000000ULL) | \\ (((uint64_t)(x) >> 8) & 0xff000000ULL) | \\ (((uint64_t)(x) >> 24) & 0xff0000ULL) | \\ (((uint64_t)(x) >> 40) & 0xff00ULL) | \\ ((uint64_t)(x) >> 56))
- David Cary, DAV's Endian FAQ (1998-2002). The big-endian v. little-endian controversy, how to avoid stupid mistakes in hardware and software
- Danny Cohen, "Internet Engineering Note (IEN) 137: On Holy Wars and a Plea for Peace" (1980). A classic
- OpenSolaris PSARC 2008/474 07/24/2008 Add 64-bit htonll() and ntohll() byte order conversion function
- OpenSolaris Change Requests
- OpenSolaris byteorder(3SOCKET) man page with new htonll() and nothll() functions
- Selected OpenSolaris source files: $SRC/lib/libc/amd64/gen/byteorder.s, $SRC/lib/libc/i386/gen/byteorder64.c, $SRC/lib/libc/sparc/gen/byteorder.c, $SRC/lib/libc/sparcv9/gen/byteorder.c, $SRC/stand/lib/xdr/byteorder.c, $SRC/uts/common/sys/byteorder.h, $SRC/uts/intel/amd64/ml/amd64.il, $SRC/uts/intel/asm/byteorder.h, $SRC/uts/intel/ia32/ml/i86_subr.s, & $SRC/uts/intel/ia32/ml/ia32.il
- Recipes for pumpkin pie, pumpkin bread, and pumpkin muffin. My pumpkin pie is not like the bland junk you buy in the store!