Finding bugs in python code, using DTrace

I have spent a lot of time in recent months helping the IPS project.  This week I integrated usability improvements, and tracked down two different performance problems.  The first was notable because I think it's the kind of thing that would have been hard to find without DTrace.   Here's a writeup I did, cribbed from the bug report:

While doing work on IPS I discovered a large number of system calls being made during the "creating plan" phase of a fresh installation of the 'redistributable' cluster of packages.

I used the following crude dtrace script to track down the problem:

#!/usr/sbin/dtrace -Fs

python$1:::function-entry
/copyinstr(arg1) == $$2/
{
self->t=1;
calls++;
}

python$1:::function-return
/copyinstr(arg1) == $$2 && self->t/
{
self->t=0;
}

syscall:::entry
/self->t/
{
@a[probefunc]=count()
}

END
{
printf("calls: %d", calls);
}

No points for elegance but it works.

Successive guesses allowed me to narrow this down to the get_link_actions() routine. Here's an example of its system call impact during the following sequence:

# rm -fr /tmp/foo
# export PKG_IMAGE=/tmp/foo
# pkg image-create -a foo=http://ipkg.eng /tmp/foo
# pkg install -n redistributable
I ran the D script as follows:
$ dtrace -s /s4pyfunc.d `pgrep client` get_link_actions
And once the command was done, got this output:
CPU     ID                    FUNCTION:NAME
0 2 :END calls: 880

brk 26
stat64 774400
close 775280
fcntl 775280
fsat 775280
fstat64 775280
getdents64 1553200
So here we can see that something is amiss-- this routine is the source of 4.7 million syscalls. Here's the code:
        def get_link_actions(self):
""" return a dictionary of hardlink action lists indexed by
target """
(a) if self.link_actions:
return self.link_actions

d = {}
for act in self.gen_installed_actions():
if act.name == "hardlink":
t = act.get_target_path()
if t in d:
d[t].append(act)
else:
d[t] = [act]

self.link_actions = d
return d

When Bart and I first looked at this, it seemed by inspection to be working as designed-- the first time the routine is called, we populate a cache, and subsequently the cache is returned. At some point Bart went "Aha!" and we realized that the problem here was that the comparison at (a) wasn't working right, and that, for a not-yet-installed image, the link_actions dictionary is exactly equal to {}, the empty dictionary. That evaluates to "false" in python. So we always think that we haven't yet filled out the link_actions cache, and go to fill it. That sends us into gen_installed_actions(), which is expensive. It seems to get more and more expensive as we accumulate more manifests, as well.

So, while there are exactly zero link actions we need to worry about, we rescan all manifests for every new manifest we evaluate.

This is a situation which will crop up most often with zones, since that's the place we're most likely to create new images from scratch.

The fix is very simple: set self->link_actions to None initially (in the object constructor), and change the comparison at (a) to test against None explictly.

Having fixed this, the impact of get_link_actions is:
CPU     ID                    FUNCTION:NAME
0 2 :END calls: 880

stat64 3
close 4
fcntl 4
fsat 4
fstat64 4
getdents64 8
Or, basically, nothing. In some informal testing I got these numbers with ptime(1): Before:
real     1:26.730
user 1:10.988
sys 11.624
After:
real     1:12.932
user 1:02.475
sys 6.361

Or about a 19% improvement in real time. And quite a bit in sys time. Not a huge win, but not too shabby either.

I used the same technique to find a similar bug in another part of the code, but alas, Danek had already found the same bug, so I can't claim it.
Comments:

Post a Comment:
Comments are closed for this entry.
About

Kernel Gardening.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today