By tf on Jan 19, 2009
While working on modifying the HA Containers agent to support the ipkg zone brand type from OpenSolaris, I stumbled over a difference in behavior of ksh88 vs ksh93 to be aware off, since it will at least impact the GDS based agents.
The difference affects commands that exist within user land as well as shell build-in, like sleep, and how they are then seen within the process table.
Example: Consider the following script called "
#!/bin/ksh # send one sleep into background - it will run longer than the script itself sleep 100 & # another sleep, this time not in the background sleep 20 echo "work done" exit 0
If you then invoke the script on OpenSolaris, and while the second sleep runs invoke "
ps -ef | grep mytest" in a different window, it will show something like:
root 8185 8159 0 06:48:13 pts/4 0:00 /bin/ksh /var/tmp/mytesting.ksh root 8248 8178 0 06:48:32 pts/5 0:00 grep mytesting root 8186 8185 0 06:48:13 pts/4 0:00 /bin/ksh /var/tmp/mytesting.ksh
Note the two processes with the same name.
After the second
sleep 20 did finish, you will see the script has terminated with printing "work done".
ps -ef | grep mytest" will show:
root 8262 8178 0 06:48:37 pts/5 0:00 grep mytesting root 8186 1 0 06:48:13 pts/4 0:00 /bin/ksh /var/tmp/mytesting.ksh
until the first sleep did also finish.
What is interesting is that the sleep put into background has the script name in the process table.
On a system where
/bin/ksh is ksh88 based, you would see a process called "
sleep 100", and "
/bin/ksh /var/tmp/mytesting.ksh" just once.
If you on the other side create the following script called "
#!/bin/ksh # send one sleep into background - it will run longer than the script itself /usr/bin/sleep 100 & # another sleep, this time not in the background /usr/bin/sleep 20 echo "work done" exit 0
And do the above testing again, you will see:
# ps -ef | grep mytesting root 8276 8159 0 06:57:31 pts/4 0:00 /bin/ksh /var/tmp/mytesting2.ksh root 8292 8178 0 06:57:36 pts/5 0:00 grep mytesting
Ie. the script appears just once. And you can see:
# ps -ef | grep sleep root 8278 8276 0 06:57:31 pts/4 0:00 /usr/bin/sleep 20 root 8296 8178 0 06:57:43 pts/5 0:00 grep sleep root 8277 8276 0 06:57:31 pts/4 0:00 /usr/bin/sleep 100
While the second sleep is still running and the script has not finished.
Once it has finished, you just see:
# ps -ef | grep sleep root 8306 8178 0 06:57:55 pts/5 0:00 grep sleep root 8277 1 0 06:57:31 pts/4 0:00 /usr/bin/sleep 100
and no longer the /var/tmp/mytesting2.ksh process.
This does make a difference in our logic within the GDS based agents, where we disable the pmf action script. Before doing that we invoke a sleep with the START_TIMEOUT length to assure that at least one process is within the tag until the action script is disabled.
And in our probe script, the wait_for_online mechanism does check if the start_command is still running. If it does, the probe returns 100 to indicate that the resource is not online yet.
So far many of our code invokes sleep instead of
/usr/bin/sleep - and with the wait_for_online combination above, this will cause our start commands to always run into a timeout - although the script itself does terminate and everything worked fine. Manual testing will not show anything obvious as well.
The fix is to always invoke
/usr/bin/sleep, not the shell build-in sleep.
Took me a while to understand it, and I write it here so others are not scratching their head as I did