DTrace - IE-OSUG The scary world of Gnome System Monitor

Hi well for my sins I got co'opted into doing a Daily DTrace demo to the Irish OpenSolaris User Group's October Meeting group last night with Sean McGrath. Tim Foster claims he was babysitting, but we'll never know for sure, I think the Guinness lured him away ;)

So what to demo. Well we wanted to show folks who didn't want to deep dive on 50 slides of DTrace bootcamp, how to do something useful pretty quickly. So we fired up a desktop with nevada build 75 to take a look at something right away, and left the slides for another day. One place that I new already was a good place to start was the Gnome System Monitor. 

Whats the problem?

To be technical GSM sucks, it's gobbling up my cpu like a mad thing !! 

What next?

Well lets compare it to prstat (similar to top for dumping out process info on the fly to the terminal) first - himm prstat is consuming 0.1% CPU, so useful benchmark to compare against. GSM is eating 5 - 20% CPU :( Now as GSM is a gnome GUI app I'd expect some overhead say 1% perhaps, but not 5 - 20 %.

So run a quick dtrace oneliner to look at syscalls for gnome-system-mon and see what's hot:

$ dtrace -n 'syscall:::entry/ execname == "gnome-system-mon"/ \\
{@[probefunc] = count();}

Himm ... lots of open's, preads all looks a bit suspicious [I'm on the road on a Mac so can't dump output, you'll just have to trust me or fire up OpenSolaris yourself and have a go :)

DTraceToolkit 

OK - lets use the precanned scripts in the latest DTraceToolkit (cudos to Brendan Gregg) and dig in. Use of exesnoop, iosnoop, opensnoop and errinfo are my first ports of call normally.

Ran iosnoop and turned up zippo. So not hitting the disc, weird as I'm seeing a lot of open's, something must be in memory I guess.

Ran opensnoop:

$ opensnoop -ap `pgrep -f gnome-system-mon`

How I see a load of open's flying by on the /proc filesystem. This is a virtual file system analogous to what's on linux that contains process information on your currently running processes. Running the same thing on prstat and I only see one hit on /proc not hundreds in the space of a few seconds :(

I can use dtrace to aggregate the open calls and generally poke into them. The syscall man pages are your friend to find out what the args are that are being passed to dtrace.

$ dtrace -n 'syscall::open\*:entry/execname == "gnome-system-mon"/ \\
{@[copyinstr(arg0)] = count();} tick-10sec{exit(0);}

Using the tick probe allows me to get an accurate timed experiment and I can just change the execname to prstat to see what it's up to over the equivalent period.

Now for the sake of the demo I didn't go off and check the timings for the syscalls and the breakdown of on/off cpu, which I'd normally do to confirm that these large number of opens and preads was a bad thing, but I'll leave that up to the reader :) Just use the syscall entry and return probes and the on-cpu, off-cpu sched provider probes (the toolkit has plenty of examples).

Stacks

What I did next was to modify the above script to see where the open's where being made from in GSM and also in prstat.

$ dtrace -n 'syscall::open\*:entry/execname == "gnome-system-mon"/ \\
{@[copyinstr(arg0), ustack()] = count();} tick-10sec{exit(0);}

For GSM I could see that we where making lots of calls into libgtop for what looked like the entire contents of the /proc dir. For prstat we could see that it was not using libgtop, but doing a much more sensible dir scan to find out what had changed and checking that only.

So it's clear that on OpenSolaris libgtop is clearly doing some very suboptimal things when it comes to checking process stats, that would really need fixed.

Errorinfo 

Just for fun I thought well as GSM is doing such crazy things with /proc, lets fire up errorinfo and see if that will reveal any other strangeness. As soon as I do we see lots and lots of errors in opens, reads and so on for GSM.

Lets dig in a little more: 

$ dtrace -n 'syscall::open\*:return/execname == "gnome-system-mon" && \\
arg1 != 0/{@[fds[arg0].fi_pathname, arg1] = count();} tick-10sec{exit(0);}

Here we are checking the open return and only looking at the file it's dealing with if the open fails (can use errno or arg1 to get the return value of the syscall). The fds is a handy addition to dtrace to allow you to lookup file information given a file descriptor, it's base type is fileinfo_t as used in the IO provider. When we run this we can see it's hitting lots of /proc processes for processes owned by root, so we are failing on access permission. Not only that but it's doing it repeatedly on a given process id for different pieces of info. Oh well, I hope someone gives libgtop a little bit of TLC on OpenSolaris.

Why isn't this fixed? 

I know there have been some patches proposed to help, but it's one of those, well it works fine on Linux problems :( Hopefully the libgtop maintainer's can take another look to see if there are more optimal ways to query for the information taking some inspiration from prstat perhaps. Lets not maintain another OpenSolaris only patch please.

Hope you enjoyed the journey into dtrace land :)

 

 

 

Comments:

Thanks for doing the talk John & Sean - much appreciated! I've got the audio edited into shape, and added it to our podcast feed. http://www.opensolaris.org/os/project/ie-osug/podcast

Baby sitting went well, I'm pleased to say - not a peep out of Bananas all night!

Posted by Tim Foster on October 26, 2007 at 09:34 AM IST #

Post a Comment:
Comments are closed for this entry.
About

jmr

Search

Top Tags
Categories
Archives
« April 2014
MonTueWedThuFriSatSun
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
    
       
Today