One of the most interesting features of Intel's Core i7 and the Xeon 5500 processors (Nehalem) is the ability of the processor to go into a mode called "Turbo Boost". While most modern processors are capable of being power managed to run at various clock frequencies, "Turbo Boost" is different, as it allows the processor to autonomously run at a higher clock speed than would otherwise be available in the processor's maximum performance power management state (P0).
Here's how it works. When one or more of the cores on the processor is in the P0 (max perf) state, those cores may enter Turbo Boost mode, allowing them to run at the faster clock frequency. "How much faster" depends on how much power and thermal headroom is available, but in general, the more cores there are on the socket that are idle and power managed (via. C-states), the faster the remaining running core(s) will go. With the introduction of OpenSolaris support for Deep C-states which integrated in build 110, we're certainly seeing the effects...since the system is now readily taking advantage of the deeper C-states, turbo boost happens all the time. :)
But how does one observe this? It's certainly useful to know when Turbo Boost is happening, and how much of an "overclock" the processor is achieving. Fortunately, in build 110 Rafael Vanoni pushed some changes to PowerTOP that provides this observability:
This is a screenshot of PowerTOP running on a Xeon 5500 (Nehalem) based system. Notice in the P-states (Frequencies) column that the highest clock speed has (turbo) next to it. As turbo mode is entered, PowerTOP will track the average frequency of the system's processors over the sampling interval. You can actually watch that top end frequency fluctuate as utilization across the system changes.
Clearly, this observability is important for understanding system performance (but more importantly, for performance determinism). Very cool stuff...and yes, this will be present in the upcoming OpenSolaris release. Here's a video we did in which Rafael talks more about this...
In the past I've blogged about the work we've been doing over the last several years to optimize thread placement (that is, on which CPUs threads are scheduled to run) in the face of evolving system and processor architectures.
Indeed, the job of thread placement on modern systems has become quite interesting. Just about every modern processor on the market these days is (at least) multi-core, with many also presenting multiple hardware "threads", "strands", or "Hyper Threads" sharing instruction or floating point pipelines...and then there's shared caches, crypto accelerators, memory controllers... So there's a lot to consider when deciding where (on which logical CPUs) a given handful of threads should execute. Where possible we've tried to avoid having threads fight over shared system resources. If the load is light enough, and enough system resources exist that each thread can have it's own pipeline, cache (or even socket)...that's a pretty good strategy for mitigating potential resource contention.
All this good stuff is made possible by the kernel's Processor Group based CMT scheduling subsystem, which (at boot) enumerates all the "interesting" relationships that exist between the system's logical CPUs...which in turns allows the dispatcher to be smart about how it utilizes those CPUs to deliver great performance.
We (or at least I) didn't realize at the time, but all this work we were doing to make the dispatcher smarter about how it uses the CPUs, also turns out to be really useful for being smart about how you're \*not\* using the CPUs. This means that in addition to optimizing for performance, this same dispatcher awareness can be used to optimize for power efficiency.
As part of the Power Aware Dispatcher project, we extended the kernel's CMT scheduling subsystem to enumerate groups of logical CPUs representing active and Idle CPU Power Management Domains. On x86 systems, these domains are enumerated through ACPI. Being aware of these domains allows the dispatcher to place threads in ways that not only optimize performance for shared system resources, but also maximizes opportunities to power manage CPUs. For example, the dispatcher may try to coalesce light utilization on the system onto a smaller number of power domains (e.g. sockets), thus freeing up other CPU resources in the system to be power managed more deeply. On the Intel Xeon 5500 processor series based systems, this enables us to take better advantage of the processor's deep idle power management features, including deep C-states.
Also, consistent with our goals around the Tickless Kernel Architecture project, the Power Aware Dispatcher is an "event based" CPU power management architecture, which means that all CPU power state changes are driven entirely by utilization events triggered by the dispatcher as threads come and go from the CPUs. One clear benefit of this, is that when the system is idle, there no need to periodically wake up to check CPU utilization (which in itself is inefficient and wasteful). It also means that the kernel can be aggressive about adjusting resource power states (in near real-time) with respect to changes in utilization.
We really like thinking about Power Management as just another piece of Resource Management. By designing efficient resource utilization into the kernel subsystems that deal with power manageable hardware resources...we can be smart about how we utilize the system (for improved performance), and how we \*don't\* use the system (to leverage power management features). The power efficiency results we're seeing with PAD are impressive, and we're really looking forward to building on the PAD work we integrated into build 110 in the months ahead.
For the most part, and especially with respect to data center class server systems, driving
the performance component of the price : performance ratio has been our focus. But the economics of
the industry are shifting...even evolving, as systems initial purchase price represents a decreasingly
significant component of their "total cost of ownership" thanks to rising power and cooling costs. This
trend coupled with the realization that overall data center utilization remains low (15% or so), implies
the opportunities in this space are enormous.
Although performance remains key, at what cost should that performance be delivered? We \*must\* engineer
systems to deliver the performance that Sun / Solaris customers have come to expect while using no more resources than
is necessary to do so. Beyond performance, we must deliver efficiency. Therein lies the challenge of
Tonight i'll be presenting at the Silicon Valley OpenSolaris User Group meeting. I'll be giving an overview of
the OpenSolaris dispatcher, scheduling classes, processor abstractions and management tools, and debugging (whew).
Here is the slide deck i'll be using. The meeting will
be at the Sun Santa Clara campus auditorium.
Alan's blog has the details. Come heckle
me if you like... :)
Over the weekend my wife and were kicking around at the Mercado shopping center in Santa Clara. It's one of those shopping centers that has appeal for both of us...Micro Center for me, and TJ Maxx for her. :) After delighting in finding of a 2GB USB flash drive for $16, I was further delighted to see this month's Linux Journal in which an interview with Simon Phipps is featured. Fostering OpenSolaris awareness in the Linux community is a good thing, so it was nice to see a good amount of discussion there. I look forward to the day where critical mass is such that more OpenSolaris magazine articles (and perhaps dedicated magazines) begin to surface. It really was nice to see.
Ian Murdock was the speaker at this month's Silicon Valley OpenSolaris User Group
meeting. I had heard from Alan last week
that Ian would be speaking about the recently announced Project Indiana, and I wanted to
go hear more. The first I had heard about it, was from this Slashdot post, and from the flurry of ensuing discussion on the opensolaris-discuss mailing list. A collegue of mine distilled it particularly well when he said (paraphrasing) that the initial spectrum of reaction was such that some folks were realizing their greatest hopes...while others were realizing their greatest fears. :)
The "Making Solaris a better Linux than Linux" quote referenced in the Slashdot post seems to have elicited a wide range of responses from folks in the community. Some folks have expressed that they don't want to see Solaris "become a better Linux", out of concern that Solaris would lose some of it's differentiating strengths (backward compatibility / stability being a frequently raised example). Others on the thread have pointed out examples of things in the Solaris environment that they feel represent barriers for adoption...which in turn has elicited more debate as to whether those barriers are really barriers, and then more debate still as to how best to deal with them. :)
At the SVOSUG meeting, Ian gave some background describing where he's coming from, why he decided to join Sun to advocate for OpenSolaris, and his vision for Project Indiana. The devil is in the details, and it's pretty clear there are many of them, but the modivation and idea behind Project Indiana (or at least my take on it) seems fairly simple. Provide OpenSolaris with the features it needs to appeal to, and be welcoming of Linux enthusiasts and/or folks who would otherwise reach for a Linux solution.
At the meeting, I said I felt that the goal shouldn't necessarily be to make Solaris a better Linux than Linux..but to make Solaris a better Solaris, such that it appeals to Linux enthusiasts more than Linux itself does. The difference is where you set your sights. I don't believe there's any shortage of opportunity. While OpenSolaris is superior in many ways, I believe it's deficient in others. I note myself carrying around a short mental list of things that (for me) are missing, or deficient in OpenSolaris that I suspect could represent an adoption "show stopper" for someone else. My short list represents the feature gap that exists between where OpenSolaris is, and where (as a developer) I wish it would be.
I suspect that such a list would vary depending on who you ask. For Project Indiana, I would imagine that characterizing what this list would look like from the perspective of a Linux enthusiast, as well as someone who tried (and gave up on) OpenSolaris would be a useful start.
Looks like we've got some clock work ahead of us. Over the last year or so, i've been waking up at night, in a cold sweat thinking about how we have but one cyclic/thread firing on one CPU 100 times a second, that does accounting for all threads
over all CPUs in the system (ok, not really, but it's something we've been thinking/talking about). As time marches
on, we continue to see the logical CPU count (as seen via psrinfo(1M)) in systems grow (especially with the proliferation
of multi-core/multi-threaded processors)...so it's not surprising that the single threaded clock has (or eventually will be) a scaling issue. Implementing clock()'s responsibilities in a more distributed fashion will be an interesting, but important bit of work.
As part of the Tesla Project, we're going to be looking at providing a "scheduled" clock implementation. The clock cyclic currently fires 100 times a second somewhere in the system. From a power management perspective, it would be nice if the clock fired only when necessary (something is scheduled
to timeout, scheduled accounting is due, etc). This would allow the CPU on which the clock cyclic fires to potentially
remain quiescent much longer (on average), which in turn would mean that the CPU could remain longer (or go deeper) in a power saving state.
It might be that the scaling issue becomes less so if the clock doesn't always have to fire. Then again, this may be one of those "elephant in the living room" type issues...you can pretend that it isn't there only so long... :)