Monday Dec 19, 2011

Flame on: Graphing hot paths through X server code

Last week, Brendan Gregg, who wrote the book on DTrace, published a new tool for visualizing hot sections of code by sampling the process stack every few milliseconds and then graphing which stacks were seen the most during the run: Flame Graphs. This looked neat, but my experience has been that I get the most understanding out of things like this by trying them out myself, so I did.

Fortunately, it's very easy to setup, as Brendan provided the tools as two standalone perl scripts you can download from the Flame Graph github repo. Then the next step is deciding what you want to run it on and capturing the data from a run of that.

It just so turns out that a few days before the Flame Graph release, another mail had crossed my inbox that gave me something to look at. Several years ago, I added some Userland Statically Defined Tracing probes to the Xserver code base, creating an Xserver provider for DTrace. These got integrated first to the Solaris Xservers (both Xsun & Xorg) and then merged upstream to the X.Org community release for Xserver 1.4, where they're available for all platforms with DTrace support.

Earlier this year, Adam Jackson enabled the dtrace hooks in the Fedora Xorg server package, using the SystemTap static probe support for DTrace probe compatibility. He then noticed a performance drop, which isn't supposed to happen, as DTrace probes are simply noops when not being actively traced, and submitted a fix for it upstream.

This was due to two of the probes I'd defined - when the Xserver processes a request from a client, there's a request-start probe just before the request is processed, and a request-done probe right after the request is processed. If you just want to see what requests a client is making you can trace either one, but if you want to measure the time taken to run a request, or determine if something is happening while a request is being processed, you need both of the probes. When they first got integrated, the code was simple:

#ifdef XSERVER_DTRACE
                XSERVER_REQUEST_START(GetRequestName(MAJOROP), MAJOROP,
                              ((xReq *)client->requestBuffer)->length,
                              client->index, client->requestBuffer);
#endif
                // skipping over input checking and auditing, to the main event,
                // the call to the request handling proc for the specified opcode:
                    result = (* client->requestVector[MAJOROP])(client);
#ifdef XSERVER_DTRACE
                XSERVER_REQUEST_DONE(GetRequestName(MAJOROP), MAJOROP,
                              client->sequence, client->index, result);
#endif

The compiler sees XSERVER_REQUEST_START and XSERVER_REQUEST_DONE as simple function calls, so it does whatever work is necessary to set up their arguments and then calls them. Later, during the linking process, the actual call instructions are replaced with noops and the addresses recorded so that when a dtrace user enables the probe the call can be activated at that time. In these cases, that's not so bad, just a bunch of register access and memory loads of things that are going to be needed nearby. The one outlier is GetRequestName(MAJOROP) which looks like a function call, but was really just a macro that used the opcode as the index an array of strings and returned the string name for the opcode so that DTrace probes could see the request names, especially for extensions which don't have static opcode mappings. For that the compiler would just load a register with the address of the base of the array and then add the offset of the entry specified by MAJOROP in that array.

All was well and good for a bit, until a later project came along during the Xorg 1.5 development cycle to unify all the different lists of protocol object names in the Xserver, as there were different ones in use by the DTrace probes, the security extensions, and the resource management system. That replaced the simple array lookup macro with a function call. While the function doesn't do a lot more work, it does enough to be noticed, and thus the performance hit was taken in the hot path of request dispatching. Adam's patch to fix this simply uses is-enabled probes to only make those function calls when the probes are actually enabled. x11perf testing showed the win on a Athlon 64 3200+ test system running Solaris 11:

Before: 250000000 trep @   0.0001 msec ( 8500000.0/sec): X protocol NoOperation
After:  300000000 trep @   0.0001 msec (10400000.0/sec): X protocol NoOperation 

But numbers are boring, and this gave an excuse to try out Flame Graphs. To capture the data, I took advantage of synchronization features built into xinit and dtrace -c:

# xinit /usr/sbin/dtrace -x ustackframes=100 \
  -n 'profile-997 /execname == "Xorg"/ { @[ustack()] = count(); }' \
  -o out.user_stacks.noop -c "x11perf -noop" \
  -- -config xorg.conf.dummy
To explain this command, I'll start at the end. For xinit, everything after the double dash is a set of arguments to the Xserver it starts, in this case, Xorg is told to look in the normal config paths for xorg.conf.dummy, which it would find is this simple config file in /etc/X11/xorg.conf.dummy setting the driver to the Xvfb-like “dummy” driver, which just uses RAM as a frame buffer to take graphics driver considerations out of this test:
Section "Device"
	Identifier  "Card0"
	Driver      "dummy"
EndSection
Since I'm using a modern Xorg version, that's all the configuration needed, all the unspecified sections are autoconfigured. xinit starts the Xserver, waits for the Xserver to signal that it's finished its start up, and then runs the first half of the command line as a client with the DISPLAY set to the new Xserver. In this case it runs the dtrace command, which sets up the probes based on the examples in the Flame Graphs README, and then runs the command specified as the -c argument, the x11perf benchmark tool. When x11perf exits, dtrace stops the probes, generates its report, and then exits itself, which in turn causes xinit to shut down the X server and exit.

The resulting Flame Graphs are, in their native SVG interactive form:

Before

After


If your browser can't handle those as inline SVG's, or they're scrunched too small to read, try Before (SVG) or Before (PNG), and After (SVG) or After (PNG).

You can see in the first one a bar showing a little over 10% of the time was in stacks involving LookupMajorName, which is completely gone in the second patch. Those who saw Adam's patch series come across the xorg-devel list last week may also notice the presence of XaceHook calls, which Adam optimized in another patch. Unfortunately, while I did build with that patch as well, we don't get the benefits of it since the XC-Security extension is on by default, and those fill in the hooks, so it can't just bypass them as it does when the hooks are empty.

I also took measurements of what Xorg did as gdm started up and a test user logged in, which produced the much larger flame graph you can see in SVG or PNG. As you can see the recursive calls in the font catalogue scanning functions make for some really tall flame graphs. You can also see that, to no one's surprise, xf86SlowBCopy is slow, and a large portion of the time is spent “blitting” bits from one place to another. Some potential areas for improvement stand out - like the 5.7% of time spent rescanning the font path because the Solaris gnome session startup scripts make xset fp calls to add the fonts for the current locale to the legacy font path for old clients that still use it, and another nearly 5% handling the ListFonts and ListFontsWithInfo calls, which dtracing with the request-start probes turned out to be the Java GUI for the Solaris Visual Panels gnome-panel applet.

Now because of the way the data for these is gathered, from looking at them alone you can't tell if a wide bar is one really long call to a function (as it is for the main() function bar in all these) or millions of individual calls (as it was for the ProcNoOperation calls in the x11perf -noop trace), but it does give a quick and easy way to pick out which functions the program is spending most of its time in, as a first pass for figuring out where to dig deeper for potential performance improvements.

Brendan has made these scripts easy to use to generate these graphs, so I encourage you to try them out as well on some sample runs to get familiar with them, so that when you really need them, you know what cases they're good for and how to capture the data and generate the graphs for yourself. Trying really is the best method of learning.

About

Engineer working on Oracle Solaris and with the X.Org open source community.

Disclaimer

The views expressed on this blog are my own and do not necessarily reflect the views of Oracle, the X.Org Foundation, or anyone else.

See Also
Follow me on twitter

Search

Categories
Archives
« December 2011 »
SunMonTueWedThuFriSat
    
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
20
21
22
23
24
25
26
27
28
29
30
31
       
Today