Tuesday Sep 30, 2008

Build turd issues you may not see in a Flag Day or a heads up...

Sometimes when you do an incremental build, you run into some fluff that kills your build:

==== cpio archives build errors (DEBUG) ====

Failed to create generic kernel archive:	200550 blocks
cpiotranslate: kernel/misc/amd64/sysinit: no packaging info
cpiotranslate: kernel/misc/sysinit: no packaging info

And everyone is supposed to know how to handle these:

[th199096@aus-build-x86 mms]> ls -al proto/root_i386/kernel/misc/amd64/sysinit
-rwxr-xr-x   1 th199096 staff       4200 Sep 25 21:22 proto/root_i386/kernel/misc/amd64/sysinit
[th199096@aus-build-x86 mms]> rm proto/root_i386/kernel/misc/amd64/sysinit proto/root_i386/kernel/misc/sysinit
[th199096@aus-build-x86 mms]> `which nightly` -in nightly.env

Hey, what do you know, Ken Erickson did have a Flag Day for those who maintain private copies of bfu; Heads up for everyone else], but it still does not mention cleaning up the turd on your own.

After all, everyone knows how to deal with these turds.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily

Saturday Sep 27, 2008

Builds are too slow...

Okay, I've got a brand new Sun Fire X4150 Server and it is geeked out with processors and memory. When I installed SUNWonbld, it said that I should use 36 for dmake concurrency. So, let's set our .make.files and let a build rip.

I'm going to modify usr/src/tools/env/developer.sh with the following:

[th199096@jhereg spe-build]> diff nightly.env  $SRC/tools/env/developer.sh
41c41
< NIGHTLY_OPTIONS="-aFCDlmprn";  export NIGHTLY_OPTIONS
---
> NIGHTLY_OPTIONS="-aCDlmpr";           export NIGHTLY_OPTIONS
194d193
< export CW_NO_SHADOW=1

I cut out the $STAFFER and such. The main differences are that I am not doing the gcc shadow building and I am not doing a non-DEBUG build. This should blaze, but it doesn't:

==== Nightly distributed build started:   Fri Sep 26 21:09:37 CDT 2008 ====
==== Nightly distributed build completed: Fri Sep 26 22:17:58 CDT 2008 ====

==== Total build time ====

real    1:08:21
...
/opt/SUNWspro/bin/dmake
dmake: Sun Distributed Make 7.7 2005/10/13
number of concurrent jobs = 4

Okay, I just took the hit to Studio 12, so maybe there is a bit more time for that. And I think I have everything local, but perhaps I am hitting the network. But lets focus on dmake telling me it will be using 4 concurrent jobs. That is by no stretch 36.

[th199096@jhereg spe-build]> grep jhereg ~/.make.machines 
jhereg   max=36
jhereg.central.sun.com   max=36

I invoke the build like this:

[th199096@jhereg spe-build]> printenv | grep DMAKE
DMAKE_MODE=parallel
DMAKE_MAX_JOBS=36
[th199096@jhereg spe-build]> env -i `which nightly` nightly.env
Time spent in user mode   (CPU seconds) : 10608.23s
Time spent in kernel mode (CPU seconds) : 6272.44s
Total time                              : 1:08:21.75s
CPU utilisation (percentage)            : 411.5%

I use env -i because someone told me that it makes sure I have just the right things in my environment. How can I tell that I'm getting the right number?

I can copy `which nightly` and hack it to just report the dmake concurrency.

[th199096@jhereg spe-build]> env -i ./nightly.tst -i nightly.env 
Testing DMAKE, quick exit
number of concurrent jobs = 4

Okay, pretty clear that I am only getting 4, but why? Add some more debugging in the main DMAKE procesisng code:

hostname=`uname -n`
if [ ! -f $HOME/.make.machines ]; then
        echo "No $HOME/.make.machines found!"
        DMAKE_MAX_JOBS=4
else
        echo "Grepping for $HOST in $HOME/.make.machines"
        DMAKE_MAX_JOBS="`grep $hostname $HOME/.make.machines | \\
            tail -1 | awk -F= '{print $ 2;}'`"
        if [ "$DMAKE_MAX_JOBS" = "" ]; then
                echo "Nothing in that file!"
                DMAKE_MAX_JOBS=4
        fi
fi
DMAKE_MODE=parallel;
export DMAKE_MODE
export DMAKE_MAX_JOBS

And run it:

[th199096@jhereg spe-build]> env -i ./nightly.tst -i nightly.env
Grepping for jhereg in /.make.machines
Nothing in that file!
Testing DMAKE, quick exit
number of concurrent jobs = 4

Hey, why is it looking in /.make.machines and not in my homedir?1

[th199096@jhereg spe-build]> echo $HOME
/home/th199096
[th199096@jhereg spe-build]> more home.tst 
#!/bin/ksh -p
#

echo "My home is $HOME"
[th199096@jhereg spe-build]> env -i ./home.tst 
My home is 
[th199096@jhereg spe-build]> ./nightly.tst -i nightly.env 
Grepping for jhereg in /home/th199096/.make.machines
Testing DMAKE, quick exit
number of concurrent jobs = 36

Okay, env is hosing me.

[th199096@jhereg spe-build]> env -i HOME=/home/th199096 ./home.tst 
My home is /home/th199096

And crap, env spells it out for me:

OPTIONS
     The following options are supported:

     -i | -        Ignores the environment that  would  otherwise
                   be  inherited  from  the  current shell.  Res-
                   tricts the environment  for  utility  to  that
                   specified by the arguments.

So, another quick test:

[th199096@jhereg spe-build]> env ./home.tst
My home is /home/th199096

I know I was told to invoke my builds this way to speed them up - i.e., to grab the correct paths. I also know I've been battling this $HOME issue the whole time.

I wonder how long the build will take now?

[th199096@jhereg th199096]> zfs clone pool/builds/th199096/spe-gate@fresh pool/builds/th199096/spe-build2
[th199096@jhereg th199096]> ws spe-build2

Workspace                    : /builds/th199096/spe-build2
Workspace Parent             : ssh://aus1500-home//pool/ws/th199096/spe-gate
Proto area ($ROOT)           : /builds/th199096/spe-build2/proto/root_i386
Root of source ($SRC)        : /builds/th199096/spe-build2/usr/src
Root of test source ($TSRC)  : /builds/th199096/spe-build2/usr/ontest
Current directory ($PWD)     : /builds/th199096/spe-build2

[th199096@jhereg spe-build2]> cp ../spe-build/nightly.env  .
[th199096@jhereg spe-build2]> vi nightly.env 
[th199096@jhereg spe-build2]> rm ../spe-build/nightly.tst 
[th199096@jhereg spe-build2]> `which nightly` nightly.env 

Yeah, zfs clone is sweet for rapid testing of a baseline!

And we get such a big savings, not!

[th199096@jhereg spe-build2]> `which nightly` nightly.env 
Time spent in user mode   (CPU seconds) : 10624.32s
Time spent in kernel mode (CPU seconds) : 7579.56s
Total time                              : 1:04:35.29s
CPU utilisation (percentage)            : 469.7%

The concurrency was correct:

/opt/SUNWspro/bin/dmake
dmake: Sun Distributed Make 7.7 2005/10/13
number of concurrent jobs = 36

All of the important tools are local:

[th199096@jhereg spe-build2]> df -h /opt/SUNWspro/bin/dmake /opt/onbld/bin/nightly /opt/onbld/bin/i386/cw /usr/java/bin/javac /usr/ccs/bin/as 
Filesystem             size   used  avail capacity  Mounted on
pool/tools             134G   9.6G    38G    21%    /pool/tools
pool/tools             134G   9.6G    38G    21%    /pool/tools
pool/tools             134G   9.6G    38G    21%    /pool/tools
/dev/dsk/c0t0d0s0       44G    11G    32G    27%    /
/dev/dsk/c0t0d0s0       44G    11G    32G    27%    /

Ok, the next thing will be to check if there is a difference between working with a clone (which has to copy-on-write) and a fresh dataset.

[th199096@jhereg spe-build3]> `which nightly` nightly.env 
Time spent in user mode   (CPU seconds) : 10634.44s
Time spent in kernel mode (CPU seconds) : 9678.18s
Total time                              : 1:08:42.11s
CPU utilisation (percentage)            : 492.7%

No. I'll have to think on this. The other option available is to reimage the system with all 3 disks in the pool:

[root@jhereg ~]> zpool iostat -v
               capacity     operations    bandwidth
pool         used  avail   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
pool        96.9G  39.1G      1      6  61.4K   448K
  c0t1d0    48.5G  19.5G      0      3  30.8K   224K
  c0t2d0    48.5G  19.5G      0      3  30.5K   224K
----------  -----  -----  -----  -----  -----  -----

Not sure how much one more spindle will reduce the build.2.

Okay, last test is to remove the following options:


And these yield the biggest savings to date3:

[th199096@jhereg spe-build4]> `which nightly` nightly.env 
Time spent in user mode   (CPU seconds) : 7993.05s
Time spent in kernel mode (CPU seconds) : 4818.01s
Total time                              : 45:44.46s
CPU utilisation (percentage)            : 466.7%

The hole we as developers tend to fall into is to want to rebuild everything. We don't always need to rebuild the BFU archives if we are just changing a kernel module. At the BAT, I was rebuilding just the nfs or nfssrv modules and scp'ing them over (I might have hosed NFS don't ya know). My "build" times were in the matter of seconds. I spent more time moving the mouse and worrying about whether or not I had changed a header which needed to be installed in my proto area.

And in the end, before I can integrate my changes, I'll need to be lint and cstyle clean, I'll need to build non-DEBUG versions, and I'll need to build for sparc. And I'll need to retest then.

I started off with a moral about questioning advice given to you versus actual experience, but it turns out the increase in dmake concurrency didn't really help, now did it?

Notes

/.make.machine

Going back, I wondered why my test did not complain about not finding /.make.machine:

[root@jhereg scripts]> ls -la /.make.machines 
lrwxrwxrwx   1 root     other         27 Sep 26 12:32 /.make.machines -> opt/onbld/gk/.make.machines
[root@jhereg scripts]> more !$
more /.make.machines
elpaso max=20

So there is a default installed by SUNWonbld.

Broken disk?

Hey, wait, don't I really have four disks and not three?

[th199096@jhereg th199096]> iostat
   tty        sd0           sd1           sd2           sd3            cpu
 tin tout kps tps serv  kps tps serv  kps tps serv  kps tps serv   us sy wt id
   0  113   0   0    0   66   2   40  304   5   28  303   5   27    3  3  0 93
# format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 
          /pci@0,0/pci8086,25f8@4/pci1000,3150@0/sd@0,0
       1. c0t1d0 
          /pci@0,0/pci8086,25f8@4/pci1000,3150@0/sd@1,0
       2. c0t2d0 
          /pci@0,0/pci8086,25f8@4/pci1000,3150@0/sd@2,0

I saw some message before the last jumpstart about taking some disk offline. And I've never really seen jhereg. It is in a lab in Austin.

Okaay, that missing disk is the DVD drive: :->

[root@jhereg ~]> iostat -En
c1t0d0           Soft Errors: 0 Hard Errors: 11 Transport Errors: 6 
Vendor: TSSTcorp Product: CD/DVDW TS-T632A Revision: SR03 Serial No:  
Size: 0.00GB <0 bytes>
Media Error: 0 Device Not Ready: 11 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
c0t0d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: HITACHI  Product: H101473SCSUN72G  Revision: SA25 Serial No: 0827DAELAA 
Size: 73.41GB <73407865856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
c0t1d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: HITACHI  Product: H101473SCSUN72G  Revision: SA25 Serial No: 0827DAG92A 
Size: 73.41GB <73407865856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
c0t2d0           Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: HITACHI  Product: H101473SCSUN72G  Revision: SA25 Serial No: 0827DA6AWA 
Size: 73.41GB <73407865856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 

Groan, I messed up the 4th build

I got my mail message for that fast lint build and it stated that build3 had finished. I got the wrong directory! I had copied over the nightly.env, fixed the path, and then made an error. So I copied the file over again. Except this time I forgot to change the path!

So the savings may have been false. Another build has been kicked off!

[th199096@jhereg spe-build4]> `which nightly` nightly.env
Time spent in user mode   (CPU seconds) : 7965.57s
Time spent in kernel mode (CPU seconds) : 4818.72s
Total time                              : 46:52.02s
CPU utilisation (percentage)            : 454.6%

So the savings were real.


Originally posted on Kool Aid Served Daily
Copyright (C) 2008, Kool Aid Served Daily
About

tdh

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today