unexpected ksh93 behaviour with build-in commands send to background

While working on modifying the HA Containers agent to support the ipkg zone brand type from OpenSolaris, I stumbled over a difference in behavior of ksh88 vs ksh93 to be aware off, since it will at least impact the GDS based agents.

The difference affects commands that exist within user land as well as shell build-in, like sleep, and how they are then seen within the process table.

Example: Consider the following script called "/var/tmp/mytesting.ksh":


#!/bin/ksh

# send one sleep into background - it will run longer than the script itself
sleep 100 &

# another sleep, this time not in the background
sleep 20
echo "work done"
exit 0

If you then invoke the script on OpenSolaris, and while the second sleep runs invoke "ps -ef | grep mytest" in a different window, it will show something like:


    root  8185  8159   0 06:48:13 pts/4       0:00 /bin/ksh /var/tmp/mytesting.ksh
    root  8248  8178   0 06:48:32 pts/5       0:00 grep mytesting
    root  8186  8185   0 06:48:13 pts/4       0:00 /bin/ksh /var/tmp/mytesting.ksh

Note the two processes with the same name.

After the second sleep 20 did finish, you will see the script has terminated with printing "work done".

However, "ps -ef | grep mytest" will show:


    root  8262  8178   0 06:48:37 pts/5       0:00 grep mytesting
    root  8186     1   0 06:48:13 pts/4       0:00 /bin/ksh /var/tmp/mytesting.ksh

until the first sleep did also finish.

What is interesting is that the sleep put into background has the script name in the process table.

On a system where /bin/ksh is ksh88 based, you would see a process called "sleep 100", and "/bin/ksh /var/tmp/mytesting.ksh" just once.

If you on the other side create the following script called "/var/tmp/mytesting2.ksh":


#!/bin/ksh

# send one sleep into background - it will run longer than the script itself
/usr/bin/sleep 100 &

# another sleep, this time not in the background
/usr/bin/sleep 20
echo "work done"
exit 0

And do the above testing again, you will see:


# ps -ef | grep mytesting
    root  8276  8159   0 06:57:31 pts/4       0:00 /bin/ksh /var/tmp/mytesting2.ksh
    root  8292  8178   0 06:57:36 pts/5       0:00 grep mytesting

Ie. the script appears just once. And you can see:


# ps -ef | grep sleep
    root  8278  8276   0 06:57:31 pts/4       0:00 /usr/bin/sleep 20
    root  8296  8178   0 06:57:43 pts/5       0:00 grep sleep
    root  8277  8276   0 06:57:31 pts/4       0:00 /usr/bin/sleep 100

While the second sleep is still running and the script has not finished.

Once it has finished, you just see:


# ps -ef | grep sleep
    root  8306  8178   0 06:57:55 pts/5       0:00 grep sleep
    root  8277     1   0 06:57:31 pts/4       0:00 /usr/bin/sleep 100

and no longer the /var/tmp/mytesting2.ksh process.

This does make a difference in our logic within the GDS based agents, where we disable the pmf action script. Before doing that we invoke a sleep with the START_TIMEOUT length to assure that at least one process is within the tag until the action script is disabled.

And in our probe script, the wait_for_online mechanism does check if the start_command is still running. If it does, the probe returns 100 to indicate that the resource is not online yet.

So far many of our code invokes sleep instead of /usr/bin/sleep - and with the wait_for_online combination above, this will cause our start commands to always run into a timeout - although the script itself does terminate and everything worked fine. Manual testing will not show anything obvious as well.

The fix is to always invoke /usr/bin/sleep, not the shell build-in sleep.

Took me a while to understand it, and I write it here so others are not scratching their head as I did ;-)

Comments:

Post a Comment:
Comments are closed for this entry.
About

This Blog is about my work at Availability Engineering: Wine, Cluster and Song :-) The views expressed on this blog are my own and do not necessarily reflect the views of Sun and/or Oracle.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today