The Phoronix Test Suite (PTS)[1] is a comprehensive testing and benchmarking platform for assessing the performance of Linux systems. In this context, while comparing the performance of Oracle Linux 7 (OL7) and Oracle Linux 8 (OL8), where both are using the same kernel version, we noticed substantial performance deltas ranging up to 30% between the two. To investigate these deltas, we performed detailed analysis to identify the reasons. The resulting analysis has helped us to be cognizant of the importance of tool-chain versions and builds when making performance comparisions.
In this blog, we explain our observations on the performance deltas in three PTS workloads: PyBench, Redis, and Scikit-Learn, and categorize them under two different cases.
In the first case, we observed deltas due to different Python builds on PyBench [2]. PyBench is a collection of about 50 different micro-benchmarks and provides a standardized way for testing Python performance on a system. Our results showed an average performance improvements of ~12% on OL8 compared to OL7. For individual micro-benchmarks, these improvements ranged from 0% to 35%.
Our initial analysis showed that although the default Python versions were the same, i.e., 3.6, the builds were not. To determine the difference in the builds, we extracted the build flags for each using the following command and found -fno-semantic-interposition flag to be the difference.
python3.6 -c "import sysconfig; \ print('{}'.format('\n'.join(['{} = {}'.format(v, sysconfig.get_config_var(v)) \ for v in sorted(sysconfig.get_config_vars(), key=lambda s: s.lower())])))" > ref.python36.cnf
We later confirmed it by checking whether -fno-semantic-interposition flag exists in each, using the following command:
>>> import sysconfig >>> '-fno-semantic-interposition' in (sysconfig.get_config_var('PY_CFLAGS') + sysconfig.get_config_var('PY_CFLAGS_NODIST'))
So why does this flag help? For supporting dynamic linking, compilers support a mechanism call “interposition” which makes function calls to libraries go through a “Procedure Linkage Table” (PLT), and helps a library loaded by LD_PRELOAD environment variable to override a function. However, this indirection has a performance cost. Moreover, respecting the interposition semantics prevents the use of function inlining, which hurts performance further.
With -fno-semantic-interposition flag in general, the compiler can optimize code in shared libraries more aggressively by ignoring potential interposition. For Python in particular, functions in libpython makes a lot of function calls to other functions which are provided by libpython. This flag optimizes functions in libpython without going through the PLT indirection anymore. As a result, they can be inlined and optimized with Link Time Optimizations.
We should note that if your Python is built with -fno-semantic-interposition, it only impacts libpython. All other libraries still respect LD_PRELOAD. For example, it is still possible to override glibc malloc/free. However, it may break your program if it was relying on some interposition related to libpython. We should also note that Python3.8 is already built with -fno-semantic-interposition flag on both OL7 and OL8.
Provided your code is not impacted by interposition, make sure that your Python interpreter is built with -fno-semantic-interposition flag.
In the second case, we observed deltas due to different glibc versions in Redis [3] and Scikit-Learn. Redis is an open source, in-memory data structure store, used as a database, cache and message broker. The PTS Redis workload [4] consists of five sub-tests, i.e., SET, GET, LPUSH, LPOP and SADD, and has client and server components. Scikit-Learn is a Python module for machine learning, featuring various regression, clustering and classification algorithms. The PTS Scikit-Learn workload [5] specifically tests for Gaussian and Sparse Random projections transformers.
Our experiments showed that Redis and Scikit-Learn running on OL8, on average, outperformed OL7 by 10% and 30%, respectively. An initial check of the two system configurations, using the following command, showed a difference in the glibc levels i.e., OL8 uses version 2.28 compared to 2.17 on OL7.
$ ldd --version
In order to root cause this observation, we generated and looked at the perf profiles. The profiles showed that significant time was spent on glibc related function calls in each. However, for OL8 relatively less time was spent on these function calls. Specifically, we made the following observations: - Improved malloc implementation helped SET, GET and SADD for Redis, resulting in significantly less time on malloc/free calls. - Improved memmove implementation helped LPOP and LPUSH for Redis. In particular, the memmove instruction in OL7, memmove_ssse3_back, is replaced with memmove_avx_unaligned_erms in OL8. - Improved libm which led to a better implementation of ‘fused-multiply-add’ for Scikit-Learn. Specifically, the instruction, **__ieee754_log_avx, in OL7 is replaced with _ieee754_log_fma** in OL8.
If your code is dependent on glibc functions, make sure you are using the most recent version for exploiting the available improvements.
In this blog, we discussed two different scenarios where we observed performance deltas between OL7 and OL8 for three different workloads. For PyBench, we identified different Python builds to be the root cause. - Python built with -fno-semantic-interposition flag on OL8 provides better performance.
For Redis and Scikit-Learn, we identified different glibc versions to be the root cause. - Improved libc and libm implementations in glibc 2.28 on OL8 outperform glibc 2.17 in OL7