performance analysis: profiling
By stp on Aug 23, 2008
Reading the forums you see some discussion about profiling, or specific numbers getting cited. I've posted a couple times about the basics, and how to use profiling to learn about what's going on in the system, but this is a little scattered. I thought I'd collect some of those details here, and give a quick round-up of what's available now, how you use it, and where this work is going. Remember this is very much a work in progress, so we really want feedback about what's useful, what's confusing, and what you'd like to see added!
If you've played with profiling or runtime-debugging facilities in other systems, what we've got in Darkstar should look pretty familiar. Basically, it collects data about what's happening in the system, and provides that data as a stream of reports. If you're writing a Service, you can register to provide data into this stream, and any developer can write any number of listeners to consume this stream of reports which then choose how to aggregate or represent the output. Real easy.
Recall that Darkstar is an event-driven system. Rather than starting threads or invoking long-running, monolithic tasks, the model is many short tasks that are in response to some event. In the case of Application code these tasks are always run in a transaction, though there is also support for non-transactional tasks. These tasks get run through the system schedulers, which decide how and when to run these task. For each task that gets run through a scheduler, a new Profile Report is generated. This represents a collection of details about what happened in the scope of that task, and makes it easy to correlate events and actions (e.g., timeout and the number of objects accessed in the data store).
If you want to start playing with all of this, where should you start? Probably the best place is in the
com.sun.sgs.profile package, where you'll find
ProfileListener. The former is the actual structure that is provided for each task run through the system. The latter is the interface that anyone can implement to consume these reports. I won't go through all the details in this entry, but hopefully the basics of the reporting structure are pretty clear (if not, we want to know about it!). In addition to basics about task type, length, delay, failure causes, etc. there are also more general-purpose components like counters and operations that let listeners track specific kinds of data-points, like bytes read by the data store, tasks scheduled, etc.
If you're impatient (well, ok, most of us fall into this category), and just want to see some data in action, skip these interfaces and go right to the
com.sun.sgs.impl.profile.listener package. Here you'll find a collection of
ProfileLIsteners that we've written to get folks started. For the most part, we've written these as we've found the need to watch for certain details or to answer specific questions; we're pretty sure these are useful debugging tools. You can include any number of these listeners in your running system, though of course you do have to be careful about the performance overhead. For the most part these are very light-weight, but YMMV.
So, how do you use these? There are two properties you need to set. The first defines the overall profiling system, and the second decides which listeners are actually used. The property that defines the entire profiling system is:
In the 0.9.6 stack all profiling is off by default, so you need to set this variable to
on. In the latest trunk (which I definitely recommend you grab if you're actually playing with performance analysis) a minimal amount of profiling is always running, but you can increase the amount of data being gathered by setting the property to
max (by default, it's running at
Once you have the profiler running, the next step is to define which listeners are included. You do this with a second property on startup:
In both the 0.9.6 and the current codebase, this variable is a colon-separated list of fully qualified classes that implement the
ProfileListener interface. So for instance, if you set
then the system will start up with the
SnapshotProfileListener running. This is a pretty simple class that collects data over some period of time. By default it reports on the last 10 seconds worth of tasks, detailing how many tasks ran versus how many succeeded, how long on average the scheduler queue was, and the number of threads being used by the schedulers. The output is displayed on port 43007 (by default), so once the system is running, just telnet to this port and you'll see an update every 10 seconds. This is not a particularly complicated or detailed listener, but it gives you a quick, high-level view of how much work the system is doing, and whether it's keeping up or falling behind.
Take a look at the other listeners in that package. They all have class javadocs that explain what they do and how they can be configured. Some output on sockets, some via loggers, and others to standard out. Collectively, they can give you some good insight into what's happening in the system. Obviously these are only a starting-point, so we'd love you to suggest other features, or write your own listeners to contribute to the community. Hopefully these are useful to you in understanding how you can push Darkstar and your Application code, and when things aren't working, why that is.
I haven't talked about the other end of this system, namely, how you get data into the profiling stream. If you're writing a Service and want to include some specific detail, Take a look at the
com.sun.sgs.profile.ProfileRegistrar interface. This is available in the registry that you get when your Service is constructed. Through this you can register to report different kinds of events and data. Note that we don't currently expose any of these interfaces to Application code yet, so you can't include details directly from your game logic, but that's on the list of features to add and (detect a theme here?) we'd definitely like to hear thoughts about how make this most useful.
This brings me to the final issue here: where we're headed. Like I said at the start of this entry, this is still a work in progress. We've got some basic features that have proven to be pretty useful in understanding application/system behavior, but we're really just getting started.
In addition to exposing interfaces to Application code, we also plan to add hooks to capture events from some of the Services that aren't reporting anything right now (especially the Session and Channel services). These will be like the details that the Data Service currently reports. You'll be able to see which calls were made (like creating a channel or sending a direct message), how many calls were made, how many bytes were sent/received, etc.
An effort that Jane is working on right now is making the whole system more dynamic. Right now you have to set the profiling level and the set of listeners on startup, and while you can set the level globally, you can't easily tune a specific set of data (e.g., turn off all reporting except from a specific Service). She's also working on exposing some of these controls through a more general JMX-driven management interface, which should make it much easier to run and manage a Darkstar stack and control how profiling happens. The ability to set levels is only the first piece of this work.
At the same time David and I have been working on a new piece of the system to track and manage conflict. While the work is more generally designed to drive the transaction and scheduling systems, one result is that
ProfileReports will now include details about what objects were accessed, and on failure, what the likely source of conflict was. This isn't in the trunk yet, but hopefully it will be soon. In the meantime, check out the contention-rev branch for the latest bits. Once we get this committed, I'll post more details about this project, and get a wiki setup to track progress and feedback.
I know this is a pretty quick, high-level overview of how profiling works, but hopefully it's useful. Remember, the goal of these features was to provide a light-weight, fairly simple set of interfaces for tracking the kinds of details that give you insight into what's happening. Like everything else in Darkstar, this is a layered system, so we've got a lot of the core collecting and reporting, but we're still building up to some of the higher-level tools and the right ways to aggregate across a cluster. We've got some good ideas about this work, but none of us has had the spare cycles to make any progress. If this sounds interesting to you please let me know, and I'll point you in the right direction.