X

News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

Linux Scheduler Scalabilty like a boss!

Scott Michael
Director, Software Development

Oracle Linux developer Subhra Mazumdar has been working on scalability improvements in the Linux scheduler. In this blog post, he talks about some of his latest work.

At Oracle, we run big database workloads which are very susceptible to how the OS chooses to schedule threads. Spending too much time in the scheduler to select a CPU can translate to higher transaction latency (at a given throughput) or lower maximum achievable transaction throughput in TPC-C workloads. In this article, we're introducing our "Scheduler Scalability" project, which improves this latency and exposes knob to further allow us to tune such workloads.

The Linux scheduler searches for idle CPUs upon which to enqueue a thread when it becomes runnable. It first searches to find a fully idle core using select_idle_core(). If that fails, the scheduler finds any idle CPU via select_idle_cpu(). Both of these routines can end up scanning the entire last level cache (LLC) domain which is expensive and can hurt context switch intensive workloads like TPC-C where threads wakeup and run for small amounts of time and go to sleep. This is a scalability bottleneck in big systems that have a large number of cores per LLC.

For such workloads, it is desirable to have a constant bound on the search time while at the same time achieving a good spread of threads. These are two conflicting interests and the right balance needs to be struck. Some experimentation with constant upper and lower bounds on the number of CPUs searched in select_idle_cpu() reveals different bounds work best for different architectures. For example, a SMT2 Intel processor upper bound of 4 and lower bound of 2 works well while on a SPARC SMT8, an upper bound of 16 and lower bound of 8 works well. A quick guess reveals an upper bound of 2 cores and a lower bound of 1 core works on both architectures. This makes sense as cores are different scheduling domains and it usually is good idea to search beyond the current domain for idle cpus as the neighbouring domain may be differently loaded. This can happen since scheduler load balancing works on a domain basis.

While putting constant bounds on the search will reduce search time, it can lead to localization of threads with uneven spreading. To solve this the scheduler can keep a per-CPU variable to track the boundary of search. If no idle CPUs are found in one instance, the search can begin from the boundary next time so that any other idle CPUs will be quickly found. Together these work well and improve the scalability of select_idle_cpu().

Next, we focus on select_idle_core() which searches for a fully idle core (i.e core that has all CPUs idle). Any CPU in that core is the best CPU to run since all the hardware resources can be used to run the thread as fast as possible. While it has a dynamic switch that turns off if no idle core is present, it is still a bottleneck since in practice we can have only a few cores fully idle and it will end up scanning the entire LLC domain. It can be challenging to come up with data structures to do it fast as this code path is very sensitive. Experiments showed touching too many cache lines during the search or using atomic operations ruins any margins.

In practice we found just disabling idle core search improves Oracle Database TPC-C on Intel x86 systems while it regresses some other benchmarks like hackbench on SPARC systems. This is a common problem in the scheduler where one workload optimized on one architecture can hurt another on a different architecture or even the same architecture. Linux uses scheduler features to work around this. It can block execution of certain code paths unsuitable for the workload. This can be turned on or off on live systems via /sys/kernel/debug/sched_features. A new scheduler feature called SIS_CORE was introduced for this purpose to disable idle core search at run time. This can be used by the Oracle Database instances meant for OLTP.

Results

Following are the performance numbers with various benchmarks with SIS_CORE true (idle core search enabled).

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline           %stdev  patch           %stdev
1       0.5816             8.94    0.5903 (-1.5%)  11.28
2       0.6428             10.64   0.5843 (9.1%)   4.93
4       1.0152             1.99    0.9965 (1.84%)  1.83
8       1.8128             1.4     1.7921 (1.14%)  1.76
16      3.1666             0.8     3.1345 (1.01%)  0.81
32      5.6084             0.83    5.5677 (0.73%)  0.8

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch            %stdev
8       45.36           0.43    46.28 (2.01%)    0.29
16      87.81           0.82    89.67 (2.12%)    0.38
32      151.19          0.02    153.5 (1.53%)    0.41
48      190.2           0.21    194.79 (2.41%)   0.07
64      190.42          0.35    202.9 (6.55%)    1.66
128     323.86          0.28    343.56 (6.08%)   1.34

Oracle Database on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline        %stdev  patch            %stdev
20      1               0.9     1.0068 (0.68%)   0.27
40      1               0.8     1.0103 (1.03%)   1.24
60      1               0.34    1.0178 (1.78%)   0.49
80      1               0.53    1.0092 (0.92%)   1.5
100     1               0.79    1.0090 (0.9%)    0.88
120     1               0.06    1.0048 (0.48%)   0.72
140     1               0.22    1.0116 (1.16%)   0.05
160     1               0.57    1.0264 (2.64%)   0.67
180     1               0.81    1.0194 (1.94%)   0.91
200     1               0.44    1.028 (2.8%)     3.09
220     1               1.74    1.0229 (2.29%)   0.21

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):
groups  baseline           %stdev  patch             %stdev
1       1.3085             6.65    1.2213 (6.66%)    10.32
2       1.4559             8.55    1.5048 (-3.36%)   4.72
4       2.6271             1.74    2.5532 (2.81%)    2.02
8       4.7089             3.01    4.5118 (4.19%)    2.74
16      8.7406             2.25    8.6801 (0.69%)    4.78
32      17.7835            1.01    16.759 (5.76%)    1.38
64      36.1901            0.65    34.6652 (4.21%)   1.24
128     72.6585            0.51    70.9762 (2.32%)   0.9

Following are the performance numbers with various benchmarks with SIS_CORE false (idle core search disabled).

Hackbench process on 2 socket, 44 core and 88 threads Intel x86 machine
(lower is better):
groups  baseline           %stdev  patch            %stdev
1       0.5816             8.94    0.5835 (-0.33%)  8.21
2       0.6428             10.64   0.5752 (10.52%)  4.05
4       1.0152             1.99    0.9946 (2.03%)   2.56
8       1.8128             1.4     1.7619 (2.81%)   1.88
16      3.1666             0.8     3.1275 (1.23%)   0.42
32      5.6084             0.83    5.5856 (0.41%)   0.89

Uperf pingpong on 2 socket, 44 core and 88 threads Intel x86 machine with
message size = 8k (higher is better):
threads baseline        %stdev  patch            %stdev
8       45.36           0.43    46.94 (3.48%)    0.2
16      87.81           0.82    91.75 (4.49%)    0.43
32      151.19          0.02    167.74 (10.95%)  1.29
48      190.2           0.21    200.57 (5.45%)   0.89
64      190.42          0.35    226.74 (19.07%)  1.79
128     323.86          0.28    348.12 (7.49%)   0.77

Oracle Database on 2 socket, 44 core and 88 threads Intel x86 machine
(normalized, higher is better):
users   baseline        %stdev  patch            %stdev
20      1               0.9     1.0056 (0.56%)   0.34
40      1               0.8     1.0173 (1.73%)   0.13
60      1               0.34    0.9995 (-0.05%)  0.85
80      1               0.53    1.0175 (1.75%)   1.56
100     1               0.79    1.0151 (1.51%)   1.31
120     1               0.06    1.0244 (2.44%)   0.5
140     1               0.22    1.034 (3.4%)     0.66
160     1               0.57    1.0362 (3.62%)   0.07
180     1               0.81    1.041 (4.1%)     0.8
200     1               0.44    1.0233 (2.33%)   1.4
220     1               1.74    1.0125 (1.25%)   1.41

Hackbench process on 2 socket, 16 core and 128 threads SPARC machine
(lower is better):
groups  baseline           %stdev  patch             %stdev
1       1.3085             6.65    1.2514 (4.36%)    11.1
2       1.4559             8.55    1.5433 (-6%)      3.05
4       2.6271             1.74    2.5626 (2.5%)     2.69
8       4.7089             3.01    4.5316 (3.77%)    2.95
16      8.7406             2.25    8.6585 (0.94%)    2.91
32      17.7835            1.01    17.175 (3.42%)    1.38
64      36.1901            0.65    35.5294 (1.83%)   1.02
128     72.6585            0.51    71.8821 (1.07%)   1.05

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha