« August 2007 | Main | November 2007 »

September 2007 Archives

September 5, 2007

Setting titles on Xterm windows

One of the frequently asked questions in our line of work is how to set the titles of Xterm windows. Especially when we have so many xterm windows

xterms without titles.GIF:

In the above picture, Can you tell which user, machine and directory in which the following sessions are? There are only two options: either look at each icon manually or remember it.

What if you could influence the titles of each xterm window? Good news is, we can.

The answer is available freely on the internet: http://www.faqs.org/docs/Linux-mini/Xterm-Title.html

If we add this to $HOME/.kshrc (assuming KSH is the shell being used), the information that is constant throughout the lifetime of the shell.

echo "\033]0;${USER}@${HOSTNAME}\007"

This is how it will look after this is done.

xterms with titles.GIF:

PS: The hostnames are totally fictional and do not represent any real server names

A Better method - Thanks Laurent Schneider

Laurent showed a better way

The problem with echo is that it is run only once. That means, if you
are on usellx42 and do ssh usellx41, the header will change to
usellx41. But when you log off, the title will not be reset and will be
misleading (usellx41).


My approach is to set in in PS1, so that it will always be correct.


Ex in .kshrc
PS1=$(echo -e "\033]0;${USER}@${HOSTNAME} \007  ${USER}@${HOSTNAME}$ ")



Cheers,
Laurent


September 6, 2007

Do I need permission to make ssh keys work?

preface


The topic of SSH keys setup and making ssh/scp connections work across hosts without password has been beaten to death in several unix and technical forums already.

Google is your friend and the entire world knows that any website like http://pkeck.myweb.uga.edu/ssh/ (the very first search result) will tell you the nitty gritties of setting up SSH keys successfully using RSA and DSA standards.

The unexpected hitch

Sometimes, even the most straigtforward configuration would not work. This is one such experience with a very simple SSH key setup, which was working from one host to another, but not vice versa. It was very frustrating to go over the seemingly simple setup steps, only to discover that they had indeed been done.

Then why in the name of Unix would it not work?!!

Granting Over permissions can bring you down too

During the trying time of making SSH work, I had either given777 (drwxrwxrwx) permission to the home directory (/home/applmgr) of applmgr unix id OR someone else had opened it up for some writing/copying purpose.

I noticed that the host on which SSH key was working did not have 777 permissions on its home directory, rather it had 755 (drwxr-xr-x).

So maybe this was the missing link? Anyways, I made the directory permissions 755 on the machine where SSH key was not working and would always ask for the password.

After that, SSH session worked like a charm. What a silly, undocumented setup step!

Here's what had been done so far:

1.) sign-on or 'su' to the appropriate app ID
2.) type:  ssh-keygen -t dsa
3.) copy (scp or other) the .pub file to .ssh directory on target server(s)
4.) rename the file to be called "authorized_keys2" by doing this:
    ssh to the target server (will be prompted for id/pw)
    $ cat id_dsa.pub >/home/$USER/.ssh/authorized_keys2  
5.) log off

Here's what ELSE had to be done to make it actually work:

Now you see it:

usell001.corp.company.us:NoOracle> ls -ld .
drwxr-xr-x   31 applmgr  users        4096 Jul 13 16:31 .
usell041.corp.company.us:NoOracle> chmod 777 .

usell008:web_prod> ssh usell041

     This system is for authorized use only.  Unauthorized access by any
     means is forbidden.  All access and activity on this system is
     logged and logs are reviewed regularly.  Activity on this system
     carries no right of privacy.  Unauthorized access will be
     investigated and prosecuted to the full extent of the law.

applmgr@usell041's password:
#
# Notice that it is asking for password
#
usell008:web_prod>


Now you don't:

usell041.corp.company.us:NoOracle> chmod 755 .
usell041.corp.company.us:NoOracle>

usell008:web_prod> ssh usell041

     This system is for authorized use only.  Unauthorized access by any
     means is forbidden.  All access and activity on this system is
     logged and logs are reviewed regularly.  Activity on this system
     carries no right of privacy.  Unauthorized access will be
     investigated and prosecuted to the full extent of the law.

Warning: No xauth data; using fake authentication data for X11 forwarding.
usell041.corp.company.us:NoOracle>

#
# Notice that it did not ask for any password and went right into usell041
#
usell041.corp.company.us:NoOracle>

September 14, 2007

Quick script for searching patch timings from command line..

Preface


Sometimes you may have to find the timings of a particular patch quickly. Can we write a quick script to find the timings on different tiers (admin tier, web tier, forms tier etc)? Would it not be advantageous to have such a script.

If you take the idea further, you can even setup SSH keys across the diffrent tiers and execute this script remotely. Imagine the productivity you can achieve sitting remotely, especially when you may been asked to come up with the patch application timings for a bunch of 20 patches across all the tiers of a test instance.

In such a scenario, simple SSH key setup and a handy unix script would be wonderful. In this example, I give you the ability to extend such a script, which is built on really simple logic of pattern searching. It would be a good case study for people learning shell scripting.

Example of the find patch timing script

midserver1.corp.company.us:web_qa> ./find_time_taken_for_a_patch.sh 6133653
Filename: 6133653_NLS_u_merged.log
------------------------------
Started:  Fri Aug 31 2007 08:27:48
Ended:   Fri Aug 31 2007 08:33:56
Time Taken: elapsed: 0:6:8 (368 seconds)
Filename: u6133653.log
------------------------------
Started:  Fri Aug 31 2007 08:11:40
Ended:   Fri Aug 31 2007 08:20:22
Time Taken: elapsed: 0:8:42 (522 seconds)

The script assumes an auxiliary script called timediff.sh and two other awk scripts for finding the last start and end time of adpatch session. It assumes that the patching logs are in $APPL_TOP/admin/$TWO_TASK/log.

At this point, it doesnt not check if basic environment variables like $APPL_TOP and $TWO_TASK are set, but that is easily done. It would make for good homework for the reader.

Using this concept at the next Level

To do batch processing and save effort, this can be taken to the next level by writing a small wrapper for a bunch of patches (say).

midserver1.corp.company.us:web_qa> more patches
6242856
6339534
5161680
5989593
5474883
5473858
5382500
5337777

midserver1.corp.company.us:web_qa> for patch in `cat patches`
do
  #
  # Assuming that SSH key has been setup between
  # myserver1 and myserver2
  #
  ssh myserver2 ". setdb QA;  
                 $HOME/find_time_taken_for_a_patch.sh $patch"

   $HOME/find_time_taken_for_a_patch.sh $patch
done

Listing of the shell/awk scripts

find_time_taken_for_a_patch.sh

midserver1.corp.company.us:web_qa> more find_time_taken_for_a_patch.sh
#!/bin/ksh

[ $# -lt 1 ] && echo "Usage: `basename $0` <patch#>" && exit 1

patch=$1
cd $APPL_TOP/admin/$TWO_TASK/log

for i in *${patch}*.log
do
start_time=`awk -f $HOME/find_start_time.awk $i`
end_time=`awk -f $HOME/find_end_time.awk $i`
echo "Filename: $i\n------------------------------"
echo "Started: $start_time"
echo "Ended: $end_time"
stime=`echo $start_time | awk '{ print $NF}'`
etime=`echo $end_time | awk '{ print $NF}'`
echo "Time Taken: `$HOME/timediff.sh $stime $etime`"
done

timediff.sh

#!/bin/ksh

( [ -z $1 ] || [ -z $2 ] ) && exit 1

time=$1
h1=$(expr "$time" : "\(..\):..:..")
m1=$(expr "$time" : "..:\(..\):..")
s1=$(expr "$time" : "..:..:\(..\)")

time=$2
h2=$(expr "$time" : "\(..\):..:..")
m2=$(expr "$time" : "..:\(..\):..")
s2=$(expr "$time" : "..:..:\(..\)")

#echo "from=$h1:$m1:$s1 to=$h2:$m2:$s2"

seconds=$(echo "$h2*3600+$m2*60+$s2-($h1*3600+$m1*60+$s1)" | bc)

if [ "$seconds" -lt 0 ] ; then
((seconds=seconds+86400))
fi

hh=$(expr $seconds / 3600)
mm="$(expr \( $seconds - $hh \* 3600 \) / 60)"
ss="$(expr $seconds - $hh \* 3600 - $mm \* 60)"
echo "elapsed: $hh:$mm:$ss ($seconds seconds)"

find_start_time.awk

################################################################
# this is the input pattern which will be used for finding the
# start time
################################################################
#  Start of AutoPatch session
# AutoPatch version: 11.5.0
# AutoPatch started at: Fri Aug 31 2007 07:21:10
################################################################
/Start of AutoPatch session/ { started=""; getline; getline; for (i=4;i<=NF;i++) started=started" "$i; };
END { print started }

find_end_time.awk

# Log and Info File sync point:
# Fri Aug 31 2007 07:50:38
# AutoPatch is exiting successfully.

/Log and Info File sync point:/ { getline; ended_time=$0; getline; if ($0 !~ /AutoPatch is exiting successfully/) ended_ti
me=""; }
END { print ended_time }


September 19, 2007

Reducing KGL latch contention while querying v$ internal views in 9.2.x RDBMS

Preface


Latch contention can be the hidden killing element that would slow down your system in a very subtle way: creating a new database connection would take more time (upto 2 mins) if there is medium to extreme contention for different KGL latches. While no amount of tracing would uncover it for you, this is one area where statspack analysis can help you.

In one such customer production instance, one would observe inexplicable occasions when the production instance would slow down suddenly. During these times, it was noticed that spawning a new sql*plus connection would take a long time (sometimes upto 1-2 mins) as compared to less than 2-3 seconds. How is one even supposed to handle such kinds of "freezes"?

Could be tell when this DB freeze was happening?

A very simple piece of unix Shell script was written for checking the time taken for logging into SQL*plus:

check_response()
{
#
# get the latch free waits
#
sqlplus -s "/ as sysdba" > ${script}/apps/tempmsg.out << EOF
exit
EOF

time sqlplus -s system/${PASSWD}@${ORACLE_SID} << EOF
exit
EOF
echo "\n\n++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n\n" >> ${script}/apps/tempmsg.out
}

notify()
{
##################################
# the threshold is so many seconds
##################################
threshold=10

echo "--------------------------------------------------------------" >> ${script}/apps/tempmsg.out
echo "PLEASE KEEP THIS EMAIL FOR FUTURE REFERENCE....\n\n" >> ${script}/apps/tempmsg.out
echo "This is an attempt to de-mystify and automatically record the occurance of so called system lockups/freezes.\n\n" >>  ${script}/apps/tempmsg.ou
t
echo "THINGS YOU MIGHT HEAR ABOUT SOON:" >>  ${script}/apps/tempmsg.out
echo "++++++++++++++++++++++++++++++++" >>  ${script}/apps/tempmsg.out
echo " 1) The customer service reps complain of freezing or hanging or locking up or blue screens." >> ${script}/apps/tempmsg.out
echo " 2) Simple DB login for a new sessions might take ~ 1 minute as opposed to instantly.\n\n" >> ${script}/apps/tempmsg.out
echo "--------------------------------------------------------------" >> ${script}/apps/tempmsg.out

if [ $time_taken_in_seconds -gt $threshold ];
then
  ###################################################################
  # send out a message to all the people who want to be notified
  ###################################################################
fi
}

The Statspack story

The statspack analysis told a very interesting story. It seemed that the latch contention and misses were quite high during these "freezing" times. In the statspack, the largest number of sleeps and  waiter sleeps were for caller kglic. 

Now, kglic is the code which goes through the  library cache and row cache to answer queries on various dictionary fixed  views and tables. This is the  function which returns data for the fixed views and tables that scan the sql  area.

Therefore, it was highly possible that such queries could also be coming from  monitoring tools used by DBAs and they are not restricted to the two views  specifically mentioned in the bug by Joan. Any monitoring job which looks at  v$open_cursor would also use the kglic iterator.

One of the contributers would be dbms_pipe: a quick test using dbms_pipe was done - one session loops putting messages in a  pipe and a separate session reads messages off the same pipe.

The loop was  run 100,000 times. What was found was that library cache latch gets in 9i are  double that of 10g (approx. 400,000 in 10g vs 800,000+ in 9i). If anything,  dbms_pipe performance seems to be more optimized in 10g.

Potential Workarounds

One could possibly reduce the  frequency of queries on v$sql views and x$kglob etc. to see if that reduced the contention. While this was possible, certainly the sql monitoring scripts were not the only contributors to this kind of latch contention.

Resolving it through RDBMS code fix

After a lot of research, the following 9207 one-off fixes seemed to be addressing this dependency graph issue:

1) Patch 4451759   [ Base Bug(s): 4451759  ]

   This bug is fixed in 9208 as per bug 4635723.

2) Patch 5094515   [ Base Bug(s): 4450964 5094515 4339128  ]

  Bug 4450964  is fixed in 9208 as per bug 4509067.
  Bug 4339128  is fixed in 9208 as per bug 4482601.

Also, this latch contention issues are fixed in 9.2.0.8 patchset. While there is a rare chance that not many customers are on 9.2.0.7 now, now that it has been desupported by Oracle Support and is supported on a limited basis. At the same time, I am aware that there are a lot of people who are on 9.2.0.7. This article might be of interest to them.

Seeing the advantage after the RDBMS patches

To simulate this freeze issue, a bunch of sql scripts which were monitoring x$kglob, v$open_cursor and v$sql views were run continuously for about 2 hours across 20 threads simulataneously. This did simulate the hang situation for the database login. The same regression was done after the patches and statspack data (level 5) was collected.

Statspack top events, latch waiting and contention information before and after the RDBMS patches:

1) This diagram shows the comparision of the latch misses before and after the patches:

latch misses sources for DB - before and after.GIF:

2) This diagram shows the comparision of the sleeps before and after the patches:

latch sleep breakdown for DB - before and after.GIF:

Conclusion

The moral of the story is over querying from the v$ and x$ views can sometimes be detrimental and cause login slowdowns which may be hard to even debug or analyze. More analysis is always possible through hang analyze commands, but that will serve as a confirmation point for checking the depth of the session dependency graphs and if there is genuine deadlock/race condition. Statspack analysis and simple sql scripts can also be used a complementary tool for checking if the latching condition has been improved.

The 9.2.0.8 patchset delivers important latch contention fixes in it and hence should be imbibed as soon as possible.


About September 2007

This page contains all entries posted to Experiments from the Field..Based on True Stories in September 2007. They are listed from oldest to newest.

August 2007 is the previous archive.

November 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Creative Commons License
This weblog is licensed under a Creative Commons License.
Powered by
Movable Type and Oracle