I recently pushed source changes to OpenSolaris build NV99 to improve byte swapping performance.
These changes will be a part of OpenSolaris 2008.11.
To review, byte swapping reverses the order of bytes in a integer (whether 2-, 4-, or 8-byte lengths).
This is necessary as x86 processors store the low order byte of an integer first ("little endian"), and SPARC processors store the high-order byte first ("big endian").
Furthermore, data on the network is usually transmitted in big endian order and needs to be translated on little endian machines.
As to which "byte endianness" (aka "byte sex") is better,
this has been a subject of heated "religious" debates
(see Cary and Cohen in references).
alt="Bite swapping with Holly" width="470" height="340" border="0" />
Don't confuse byte swapping with bite swapping
My changes are as to:
/ uint64_t htonll(uint64_t hostlonglong);
movq %rdi, %rax / copy parameter 1 to return value
bswapq %rax/ byte swap return value
What were the performance benefits?
Refer to the chart below.
On the upper left, you can see performance improvement with BSWAP_32 and BSWAP_64 on X86-64 bit class systems.
The most dramatic was for BSWAP_32 running 32-bit object, but every category showed improvement.
Legend: "AMD64 32b" is 32-bit binary running on AMD64. "EM64T 32b" is 32-bit binary running on Intel EM64T. "AMD64 64b" and "EM64T 64b" similarly are 64-bit binaries running on AMD64 and EM64T, respectively.
Time is in microseconds using a microbenchmark (100 million function calls in a loop).
Next, refer to the bottom half of the chart.
This shows X86-64 performance improvements
with various byte swapping macros.
This is from substituting inline assembly for the LE_\* and BE_\* byte swapping macros (LE for Little Endian and BE for Big Endian). Performance for the LE_IN32 macros were marginal or negative, so I left them unchanged (that is, they remain implemented as C << and >> shift operations).
However, improvements for the LE_\*64 and the BE_\*64 macros showed consistent improvement and these are now implemented in inline assembly.
Even SPARC optimization was possible (see the top right chart).
This was done by rewriting the BSWAP32 and BSWAP64 macros—not in assembly, but more-efficient C.
Consider this BSWAP_64 macro definition:
#define BSWAP_64(x) ((BSWAP_32(x) << 32) | BSWAP_32((x) >> 32))
#defineBSWAP_64(x) ((((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) |
(((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) << 32) |
((((((x) & 0xff) << 8) | ((x) >> 8) & 0xff) << 16) |
(((((x) >> 16) & 0xff) << 8) | (((x) >> 16) >> 8) & 0xff)) >> 32)
#defineBSWAP_64(x) (((uint64_t)(x) << 56) | \\
(((uint64_t)(x) << 40) & 0xff000000000000ULL) | \\
(((uint64_t)(x) << 24) & 0xff0000000000ULL) | \\
(((uint64_t)(x) << 8) & 0xff00000000ULL) | \\
(((uint64_t)(x) >> 8) & 0xff000000ULL) | \\
(((uint64_t)(x) >> 24) & 0xff0000ULL) | \\
(((uint64_t)(x) >> 40) & 0xff00ULL) | \\
((uint64_t)(x) >> 56))