CMT and Micro-parallelism
By sprack on Aug 20, 2007
While parallelism normally focusses on large trip count loops for its performance gains, there are situations where this restricted focus can be problematic:
•Small tasks:for applications that are primarily composed of short duration tasks, interposed by synchronization points (e.g., multi-phase operations), parallelization may be impossible (or very challenging).
•Single-threaded component:even when an application's hot loops have been successfully threaded, any remaining single-threaded components can rapidly become a bottleneck as the application is scaled to larger numbers of threads (courtesy of Amdahl's law). For example, if 10% of an application is single-threaded, the performance gain obtained from using eight threads is limited to 4.7X. If an additional 5% of the application can be threaded, the performance gain is increased to almost 6X, clearly illustrating the importance of achieving as close to complete parallelization as possible.
•Critical threads: the scalability of MT applications can be limited by the performance of certain critical threads. If the work undertaken by these critical threads cannot be further subdivided, performance may stop scaling when the critical threads become 100% busy.
• Critical sections: given that only one thread can occupy a critical section at once, if threads spend too long in a critical section, application scalability can suffer, as other threads stall waiting for access to the same section. While the time in the critical section may only account for a
small portion of the total processing undertaken by a thread, minimizing the time spent in the critical section may result in significant performance benefits.
In the above examples, the performance of MT applications is adversely impacted by small single-threaded sections. These single-threaded sections can be broadly divided into two categories; those which are intrinsically serial in nature, and those which are amenable to parallel processing, if the associated threading overheads can be reduced.
While the serial segments remain problematic on CMP systems, the low inter-thread communication overheads (resulting from the shared L2$) will allow even very small tasks to be cost effectively threaded. In essence, CMP systems allow `micro-parallelization'. The potential benefits of leveraging micro-parallelization are widespread and can be used to address all of the performance problems discussed in the previous paragraphs. To illustrate the benefits of microparallelization, consider Fig. (i). In this example, a simple copy operation (such as that performed by bcopy or memcpy) is divided amongst multiple worker threads using lock-free synchronization. Each worker thread performs a portion of the copy and, as the number of threads is increased, the work undertaken per thread decreases proportionally. Fig. (ii) illustrates performance as the number of threads is increased, for three different copy sizes. For the Niagara CMP system, performance scales almost linearly with the number of threads, even when each thread is only copying 64 elements. In contrast, for the traditional SMP system, while acceptable scaling is obtained for the 8192-element copy, synchronization overheads are significant and performance regressions are observed at both increased levels of threading and for the smaller copies — the coherency overheads associated with synchronizing the master and worker threads quickly outweighing the performance benefits associated with the additional worker threads.
From Fig (ii), it is apparent that with the advent of CMP systems, many small, short duration tasks, which we traditionally viewed as single-threaded (.e.g., small memcpy operations) can now be cost effectively threaded. While the indiscriminate threading of these operations in MT applications is not beneficial, these micro-parallelization techniques can be used to alleviate scaling bottlenecks by improving the performance of these problem components.
forward, we need to start to leverage these 'micro-parallelization'
techniques more aggressively and exploit the full potential of CMTs.
[Abstracted from the IJPP publication -- look here for more details]