One of the most interesting features of Intel's Core i7 and the Xeon 5500 processors (Nehalem) is the ability of the processor to go into a mode called "Turbo Boost". While most modern processors are capable of being power managed to run at various clock frequencies, "Turbo Boost" is different, as it allows the processor to autonomously run at a higher clock speed than would otherwise be available in the processor's maximum performance power management state (P0).
Here's how it works. When one or more of the cores on the processor is in the P0 (max perf) state, those cores may enter Turbo Boost mode, allowing them to run at the faster clock frequency. "How much faster" depends on how much power and thermal headroom is available, but in general, the more cores there are on the socket that are idle and power managed (via. C-states), the faster the remaining running core(s) will go. With the introduction of OpenSolaris support for Deep C-states which integrated in build 110, we're certainly seeing the effects...since the system is now readily taking advantage of the deeper C-states, turbo boost happens all the time. :)
But how does one observe this? It's certainly useful to know when Turbo Boost is happening, and how much of an "overclock" the processor is achieving. Fortunately, in build 110 Rafael Vanoni pushed some changes to PowerTOP that provides this observability:
This is a screenshot of PowerTOP running on a Xeon 5500 (Nehalem) based system. Notice in the P-states (Frequencies) column that the highest clock speed has (turbo) next to it. As turbo mode is entered, PowerTOP will track the average frequency of the system's processors over the sampling interval. You can actually watch that top end frequency fluctuate as utilization across the system changes.
Clearly, this observability is important for understanding system performance (but more importantly, for performance determinism). Very cool stuff...and yes, this will be present in the upcoming OpenSolaris release. Here's a video we did in which Rafael talks more about this...
In the past I've blogged about the work we've been doing over the last several years to optimize thread placement (that is, on which CPUs threads are scheduled to run) in the face of evolving system and processor architectures.
Indeed, the job of thread placement on modern systems has become quite interesting. Just about every modern processor on the market these days is (at least) multi-core, with many also presenting multiple hardware "threads", "strands", or "Hyper Threads" sharing instruction or floating point pipelines...and then there's shared caches, crypto accelerators, memory controllers... So there's a lot to consider when deciding where (on which logical CPUs) a given handful of threads should execute. Where possible we've tried to avoid having threads fight over shared system resources. If the load is light enough, and enough system resources exist that each thread can have it's own pipeline, cache (or even socket)...that's a pretty good strategy for mitigating potential resource contention.
All this good stuff is made possible by the kernel's Processor Group based CMT scheduling subsystem, which (at boot) enumerates all the "interesting" relationships that exist between the system's logical CPUs...which in turns allows the dispatcher to be smart about how it utilizes those CPUs to deliver great performance.
We (or at least I) didn't realize at the time, but all this work we were doing to make the dispatcher smarter about how it uses the CPUs, also turns out to be really useful for being smart about how you're \*not\* using the CPUs. This means that in addition to optimizing for performance, this same dispatcher awareness can be used to optimize for power efficiency.
As part of the Power Aware Dispatcher project, we extended the kernel's CMT scheduling subsystem to enumerate groups of logical CPUs representing active and Idle CPU Power Management Domains. On x86 systems, these domains are enumerated through ACPI. Being aware of these domains allows the dispatcher to place threads in ways that not only optimize performance for shared system resources, but also maximizes opportunities to power manage CPUs. For example, the dispatcher may try to coalesce light utilization on the system onto a smaller number of power domains (e.g. sockets), thus freeing up other CPU resources in the system to be power managed more deeply. On the Intel Xeon 5500 processor series based systems, this enables us to take better advantage of the processor's deep idle power management features, including deep C-states.
Also, consistent with our goals around the Tickless Kernel Architecture project, the Power Aware Dispatcher is an "event based" CPU power management architecture, which means that all CPU power state changes are driven entirely by utilization events triggered by the dispatcher as threads come and go from the CPUs. One clear benefit of this, is that when the system is idle, there no need to periodically wake up to check CPU utilization (which in itself is inefficient and wasteful). It also means that the kernel can be aggressive about adjusting resource power states (in near real-time) with respect to changes in utilization.
We really like thinking about Power Management as just another piece of Resource Management. By designing efficient resource utilization into the kernel subsystems that deal with power manageable hardware resources...we can be smart about how we utilize the system (for improved performance), and how we \*don't\* use the system (to leverage power management features). The power efficiency results we're seeing with PAD are impressive, and we're really looking forward to building on the PAD work we integrated into build 110 in the months ahead.
For the most part, and especially with respect to data center class server systems, driving
the performance component of the price : performance ratio has been our focus. But the economics of
the industry are shifting...even evolving, as systems initial purchase price represents a decreasingly
significant component of their "total cost of ownership" thanks to rising power and cooling costs. This
trend coupled with the realization that overall data center utilization remains low (15% or so), implies
the opportunities in this space are enormous.
Although performance remains key, at what cost should that performance be delivered? We \*must\* engineer
systems to deliver the performance that Sun / Solaris customers have come to expect while using no more resources than
is necessary to do so. Beyond performance, we must deliver efficiency. Therein lies the challenge of
Over the weekend my wife and were kicking around at the Mercado shopping center in Santa Clara. It's one of those shopping centers that has appeal for both of us...Micro Center for me, and TJ Maxx for her. :) After delighting in finding of a 2GB USB flash drive for $16, I was further delighted to see this month's Linux Journal in which an interview with Simon Phipps is featured. Fostering OpenSolaris awareness in the Linux community is a good thing, so it was nice to see a good amount of discussion there. I look forward to the day where critical mass is such that more OpenSolaris magazine articles (and perhaps dedicated magazines) begin to surface. It really was nice to see.
Looks like we've got some clock work ahead of us. Over the last year or so, i've been waking up at night, in a cold sweat thinking about how we have but one cyclic/thread firing on one CPU 100 times a second, that does accounting for all threads
over all CPUs in the system (ok, not really, but it's something we've been thinking/talking about). As time marches
on, we continue to see the logical CPU count (as seen via psrinfo(1M)) in systems grow (especially with the proliferation
of multi-core/multi-threaded processors)...so it's not surprising that the single threaded clock has (or eventually will be) a scaling issue. Implementing clock()'s responsibilities in a more distributed fashion will be an interesting, but important bit of work.
As part of the Tesla Project, we're going to be looking at providing a "scheduled" clock implementation. The clock cyclic currently fires 100 times a second somewhere in the system. From a power management perspective, it would be nice if the clock fired only when necessary (something is scheduled
to timeout, scheduled accounting is due, etc). This would allow the CPU on which the clock cyclic fires to potentially
remain quiescent much longer (on average), which in turn would mean that the CPU could remain longer (or go deeper) in a power saving state.
It might be that the scaling issue becomes less so if the clock doesn't always have to fire. Then again, this may be one of those "elephant in the living room" type issues...you can pretend that it isn't there only so long... :)