Surprisingly slow compile time

I had an e-mail which told the sorry tale of a new system which tool longer to build a project than an older system, of theoretically similar performance. The system showed low utilisation when doing the build indicating that it was probably spending a lot of time waiting for something.

The first thing to look at was a profile of the build process using `collect -F on`, which produced the interesting result that the build was taking just over 2 minutes of user time, a few seconds of system time, and thousands of seconds of "Other Wait" time.

"Other wait" often means waiting for network, or disk, or just sleeping. The other thing to realise about profiling multiple processes is that all the times are cumulative, so all the processes that are waiting accumulate "other wait" time. Hence it will be a rather large number if multiple processes are doing it. So this confirmed and half explained the performance issue. The build was slow because it was waiting for something.

Sorting the profile by "other wait" indicated two places that the wait was coming from, one was waitpid - meaning that the time was due to a process waiting for another process, well we knew that! The other was a door call. Tracing up the call stack eventually lead into the C and C++ compiler, which were calling gethostbyname. The routine doing the calling was "generate_prefix" which is the routine responsible for generating a random prefix for function names - the IP address of the machine was used as one of the inputs for the generation of a prefix.

The performance problem was due to gethostbyname timing out, common reasons for this are missed configurations in the /etc/hosts and /etc/nsswitch.conf files. In this example adding the host name to the hosts file cured the problem.

Comments:

Yep. I've seen that before along with the fact that sometimes it requires both the "short" and "long" name in the /etc/hosts file.

How long did it take from start to finish to identify this issue? Hours? Days? Weeks?

Would be interesting to know how the old server was configured and if this was something that was already done (eg. the host name in the /etc/hosts file).

Posted by Dave Pickens on October 13, 2009 at 08:26 AM PDT #

Hard to quantify, we had discussions over about 5 days, but then we were basically on non-overlapping timezones. Once I had the profile it was pretty easy to launch the performance analyzer and identify where the time was coming from. Took a bit more reading for me to believe it was that though ;)

Regards,

Darryl.

Posted by Darryl Gove on October 13, 2009 at 08:41 AM PDT #

There is an excellent description of debugging a similar problem in ssh using dtrace:
http://developers.sun.com/solaris/articles/dtrace_example.pdf

Posted by Eoin Lawless on October 13, 2009 at 10:21 PM PDT #

Wow, what a story!

Now please step back from the whole thing for a minute:

wouldn't you say it's insane that a compile process is basically, in a convoluted and perverse way, dependent on whether a system's naming configuration is correct?

Perhaps you will disagree, but to me, that's just plain wrong: a compile procedure should not depend on what a system is or isn't named, or how naming resolution is or isn't configured, at least not for general compiling.

And finally, the real question is: will this be fixed?

Posted by UX-admin on October 14, 2009 at 12:10 AM PDT #

If you want randomness don't roll your own. If it isn't required to be of crypto strength use things like rand48() or if you do need strong randomness read from /dev/urandom or /dev/random (long term cryptographic key strength).

Posted by Darren Moffat on October 14, 2009 at 03:38 AM PDT #

@Eoin: I thought I'd seen something along those lines somewhere. Thanks for digging out the URL, it's a good read.

@UX-admin: I agree. Roughly. We do need to generate prefixes that are unique, so we need to use some characteristics to generate them. I don't have a particular opinion what those are so long as they have the desired results. However, I don't like relying on an interface that could have a 10 second time-out. I've filed an rfe against this, we should improve this.

@Darren: Thanks for the suggestion. I'm not quite sure what the desirable characteristics for the number are. I suspect that we want something that is unique, but predictably random, rather than randomly random. I've not looked at the code in any detail.

Posted by Darryl Gove on October 14, 2009 at 04:31 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

Darryl Gove is a senior engineer in the Solaris Studio team, working on optimising applications and benchmarks for current and future processors. He is also the author of the books:
Multicore Application Programming
Solaris Application Programming
The Developer's Edge

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
5
6
8
9
10
12
13
14
15
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today
Bookmarks
The Developer's Edge
Solaris Application Programming
Publications
Webcasts
Presentations
OpenSPARC Book
Multicore Application Programming
Docs