Topology-Aware Scheduling

Continuing in my feature deep dives, let's talk about topology-aware scheduling. Some applications have serious resource needs. Not only do they need raw CPU cores, but they also beat the snot out of the local cache or burn up the I/O channels. These sorts of applications don't play well with others. In fact, they often don't play well with themselves. For these applications, how the threads/processes are distributed across the CPUs makes a huge difference. If, for example, all the threads/processes have their own core but are all sharing a socket, they might end up fighting over cache space or I/O bandwidth. Depending on the CPU architecture, the conflicts may be more subtle, such as only the processes on specific groups of cores colliding. The price for making a bad choice of how to assign these applications to cores is poor performance, in some cases doubling the time to completion.

It's not just the powerhouse apps that care about CPU topology, though. Most operating systems will schedule processes and threads to execute on available cores rather willy-nilly, with no sense of core affinity. Because an average OS does context switches at a rather high frequency, an application may find itself executing on a different CPU and core every time it gets the chance to run. If that application makes any use of the CPU cache, for example, its performance will suffer for it. The performance might not suffer much, but the difference is usually measurable.

For these reasons, we've added topology-aware scheduling to Sun Grid Engine 6.2 update 5. With topology-aware scheduling, the user who submits the job can specify how that job should be laid out across a machine's CPUs. Users are allowed to specify three different flavors of distribution strategy: linear, striding, or explicit. In linear distribution, the execution daemon will place the job's threads/processes on consecutive cores if possible. If it can't fit the entire job on a single socket, it will span the job across sockets. The striding strategy tells the execution daemon to place the job on every nth core, e.g. every 4th core or every other core. The explicit strategy lets the user decide exactly which cores will be assigned to the job. Note that the core binding is a request, not a requirement. If for some reason the execution daemon can't fulfill the request, the job will still be executed; it just won't be bound.

In addition to the three binding strategies, there are also three possible binding mechanisms. You can either allow Sun Grid Engine to do the binding automatically as part of the job execution, or you can have Sun Grid Engine add the binding parameters to the machines file for OpenMPI jobs, or you can have Sun Grid Engine just describe the intended binding in an environment variable with the expectation that the job will bind itself based on that information. When the job is bound by Sun Grid Engine during execution, the job will be tied to specific CPU cores using an OS-specific system call. On Linux, the bound processors may be shared with other processes. On Solaris, the bound processors are used exclusively for the job. In either case, the job will only be allowed to execute on the bound processors.

In order to allow users to tell what kinds of topologies are provided by the machines in the cluster, some new default complexes have been added that describe the socket/core/thread layouts of the machines. These new complexes can be used during job submission to request specific topologies, or they can be used with qhost to report what's available.

Let's look at a couple of examples (taken from the docs).

% qsub -binding linear:4 -l m_core=8 -l m_socket=2 -l arch=lx26-amd64

This example will look for a machine with 8 cores and 2 sockets (i.e. dual-socket, quad-core) and try to bind to four consecutive cores. The execution daemon will try to put all four cores on the same socket, but if that's not possible, it will spread the job out over as many sockets as required (but as few as possible).

% qsub -binding striding:2:4 -l m_core=8 -l m_socket=2 -l arch=lx26-amd64

This example will again look for a dual-socket, quad-core machine, but this time the job will occupy the third core on both sockets. (The first core is number 0.) If the third core on either socket is occupied, the job will not be bound.

% qsub -binding explicit:0,0:0,3:1,0:1,3 -l m_core=8 -l m_socket=2 -l arch=lx26-amd64

This last example will yet again look for a dual-socket, quad-core machine. This time the job will be bound to the first and fourth cores on both sockets. Again, if any of those core are already bound to another job, the job will not be bound.

It's clear that jobs that benefit from specific process placement with respect to CPU cores will perform much better in a 6.2u5 cluster, thanks to this new feature. Even for regular old run-of-the-mill jobs, though, submitting with -binding linear:1 should provide a small performance bump because it will keep them from being jostled around between context switches. In fact, I won't be surprised if 12 months from now I include adding that switch to the sge_request file in my top 10 list of best practices.


[Trackback] <p>I just ran across a great blog entry about SGE debuting topology&#45;aware scheduling.&amp;nbsp; Dan Templeton does a great job of describing the need for processor topology&#45;aware job scheduling within a server.&amp;nbsp; Many MPI jobs fit exact...

Posted by High Performance Computing Networking on January 23, 2010 at 12:40 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed



« April 2014