Tuesday Dec 13, 2011

Solaris Tip: Resolving "statd: cannot talk to statd at <target_host>, RPC: Timed out(5)"

Symptom:

System log shows a bunch of RPC timed out messages such as the following:


Dec 13 09:23:23 gil08 last message repeated 1 time
Dec 13 09:29:14 gil08 statd[19858]: [ID 766906 daemon.warning] statd: cannot talk to statd at ssc23, RPC: Timed out(5)
Dec 13 09:35:05 gil08 last message repeated 1 time
Dec 13 09:40:56 gil08 statd[19858]: [ID 766906 daemon.warning] statd: cannot talk to statd at ssc23, RPC: Timed out(5)
..

Those messages are the result of an apparent communication failure between the status daemons (statd) of both local and remote hosts using RPC calls.

Workaround/Solution:

If the target_host is reachable, execute the following to stop the system from generating those warning messages --- stop the network status monitor, remove the target host entry from /var/statmon/sm.bak file and start the network status monitor process. Removing the target host entry from sm.bak file keeps that machine from being aware that it may have to participate in locking recovery.

eg.,


# ps -eaf | fgrep statd 
  daemon 14304 19622   0 09:47:16 ?           0:00 /usr/lib/nfs/statd
    root 14314 14297   0 09:48:03 pts/15      0:00 fgrep statd

# svcs -a | grep "nfs/status"
online          9:52:41 svc:/network/nfs/status:default

# svcadm -v disable nfs/status
svc:/network/nfs/status:default disabled.

# ls /var/statmon/sm.bak
ssc23

# rm /var/statmon/sm.bak/ssc23

# svcadm -v enable nfs/status
svc:/network/nfs/status:default enabled.

Friday Nov 18, 2011

Siebel Troubleshooting : An ODBC error occurred; SBL-GEN-03006: Error calling function: DICFindTable m_pReqTbl

Symptom:

A newly installed Siebel application server fails to start despite successful ODBC connectivity to the database. SRProc process logs ODBC error messages similar to the following:


Message: GEN-13,
 Additional Message: dict-ERR-1109: 
       Unable to read value from export file (Data length (32) > Column definition (3)).

Message: GEN-13,
 Additional Message: dict-ERR-1107: Unable to read row 0 from export file (UTLDataValRead pBuf, col 4 ).

GenericLog  GenericError  1     0002157..  11-11-18 13:28  Message: Generated SQL statement:,
 Additional Message: SQLFetch:
   SELECT RDOBJ.DOCK_ID, RDOBJ.RELATED_DOCK_ID, RDOBJ.SQL_STATEMENT, RDOBJ.CHECK_VISIBILITY,
          'N', RDOBJ.COMMENTS, RDOBJ.ACTIVE, RDOBJ.SEQUENCE, RDOBJ.VIS_STRENGTH,
          RDOBJ.REL_VIS_STRENGTH, RDOBJ.VIS_EVT_COLS
     FROM ORAPERF.S_DOCK_REL_DOBJ RDOBJ, ORAPERF.S_DOCK_OBJECT DOBJ
    WHERE RDOBJ.REPOSITORY_ID = (SELECT ROW_ID FROM ORAPERF.S_REPOSITORY WHERE NAME = ?)
      AND DOBJ.ROW_ID = RDOBJ.DOCK_ID
      AND (DOBJ.INACTIVE_FLG = 'N' OR DOBJ.INACTIVE_FLG IS NULL)
      AND (RDOBJ.INACTIVE_FLG = 'N' OR RDOBJ.INACTIVE_FLG IS NULL)

Message: Error: An ODBC error occurred,
 Additional Message: Function: DICGetRDObjects; ODBC operation: SQLFetch

Message: GEN-13,
 Additional Message: dict-ERR-1109: Unable to read value from export file (UTLCompressFRead (fseek)).

Message: GEN-13,
 Additional Message: dict-ERR-1107: Unable to read row 0 from export file (UTLDataValRead pBuf, col 0 ).

Message: GEN-10,
 Additional Message: Calling Function: DICLoadDObjectInfo; Called Function: Calling DICGetRDObjects

Message: GEN-10,
 Additional Message: Calling Function: DICLoadDict; Called Function: DICLoadDObjectInfo

GenericError
(srpdb.cpp (860) err=3006 sys=2) SBL-GEN-03006: Error calling function: DICFindTable m_pReqTbl
(srpsmech.cpp (74) err=3006 sys=0) SBL-GEN-03006: Error calling function: DICFindTable m_pReqTbl
(srpmtsrv.cpp (107) err=3006 sys=0) SBL-GEN-03006: Error calling function: DICFindTable m_pReqTbl
(smimtsrv.cpp (1203) err=3006 sys=0) SBL-GEN-03006: Error calling function: DICFindTable m_pReqTbl
SmiLayerLog Error       Terminate process due to unrecoverable error: 3006. (Main Thread)

An inconsistent or corrupted dictionary file "diccache.dat" is likely the cause.

Solution:

  • Stop the application server and manually kill the remaining Siebel application specific processes

    eg.,

    stop_server all
    
    pkill siebmtsh
    pkill siebproc
    ..
    
  • Remove $SIEBEL_HOME/bin/diccache.dat file. It will be re-generated during the application server startup

  • Start the application server
    start_server all
    

Friday Mar 25, 2011

PeopleSoft: Fixing "msgget: No space left on device" Error on Solaris 10

When high number of application server processes are configured in a single or multiple PeopleSoft application server domains cumulative, it is very likely that the PeopleSoft application server domain boot process may fail with errors similar to the following:


Booting server processes ...
exec PSSAMSRV -A -- -C psappsrv.cfg -D CS90SPV -S PSSAMSRV :
Failed.
113954.ben15!PSSAMSRV.29746.1.0: LIBTUX_CAT:681: ERROR: Failure to create message queue
113954.ben15!PSSAMSRV.29746.1.0: LIBTUX_CAT:248: ERROR: System init function failed, Uunixerr = :
msgget: No space left on device

113954.ben15!tmboot.29708.1.-2: CMDTUX_CAT:825: ERROR: Process PSSAMSRV at ben15 failed with /T
tperrno (TPEOS - operating system error)

In this particular example, the PeopleSoft application server is running on a Solaris 10 system. Fortunately the error message is very clear in this case; and the failure is related to the message queues. During the domain boot up process, there is a call to msgget() to create a message queue. If the call to msgget() succeeds, it returns a non-negative integer that serves as the identifier for the newly created message queue. However in case of a failure, it returns -1 and sets the error number to EACCES, EEXIST, ENOENT or ENOSPC depending on the underlying cause.

From the above error messages it is evident that the msgget() failed with the errno set to ENOSPC (No space left on device). Man page of msgget(2) has the following explanation for ENOSPC error code on Solaris:

ERRORS
The msgget() function will fail if:
...
...
ENOSPC A message queue identifier is to be created but
the system-imposed limit on the maximum number of
allowed message queue identifiers system wide
would be exceeded. See NOTES.

NOTES
...
...

The system-imposed limit on the number of message queue
identifiers is maintained on a per-project basis using the
project.max-msg-ids resource control.

It has enough clues to suspect the configured number for the message queue identifiers.

Prior to the release of Solaris 10, the /etc/system System V IPC tunable, msgsys:msginfo_msgmni, could be configured to control the maximum number of message queues that can be created. The default value on pre-Solaris 10 systems is 50.

To reduce the administrative overhead, majority of the System V IPC tunables were obsoleted and equivalent resource controls were created for the remaining tunables in Solaris 10 operating system. In Solaris 10 and later versions, System V IPC can be tuned on a per project basis using the newly introduced resource controls.

In Solaris 10, the resource control project.max-msg-ids replaced the old /etc/system tunable, msginfo_msgmni. And the default value has been raised to 128.

Now back to the failure in PeopleSoft environment. Let's first check the current value for project.max-msg-ids.

  1. Get the project ID.

    % id -p
    uid=222227(psft) gid=2294(dba) projid=3(default)

  2. Using the prctl utility, examine the project.max-msg-ids resource control for the project with ID 3.

    % prctl -n project.max-msg-ids -i project 3
    project: 3: default
    NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
    project.max-msg-ids
    privileged 128 - deny -
    system 16.8M max deny -

Alternatively run the command ipcs -q to check the number of active message queues. Note that the project with id '3' is configured to create a maximum of 128 (default) message queues. In any case, the number of active message queues from the ipcs -q output may almost match with the configured value for the project.max-msg-ids.

Since it appears the configured PeopleSoft application server domains need more than 128 message queues in order to bring up all the application server processes, the solution is to increase the value for the resource control project.max-msg-ids to any value above 128. For the sake of simplicity, let's increase it to 256 (2 \* default value, that is). Again prctl utility can be used to set the new value for the resource control.
  1. Login as the 'root' user

    % su
    Password:

  2. Increase the maximum value for the message queue identifiers to 256 using the prctl utility.

    # prctl -n project.max-msg-ids -r -v 256 -i project 3

  3. Verify the new maximum value for the message queue identifiers

    # prctl -n project.max-msg-ids -i project 3
    project: 3: default
    NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
    project.max-msg-ids
    privileged 256 - deny -
    system 16.8M max deny -

To make this change persistent, create a Solaris project and attach it to the OS user as shown below.

% projadd -p 100 -c "PeopleSoft App Server IPC Tuning" -K "project.max-msg-ids=(priv,256,deny)" psftappipc % usermod -K project=psftappipc psft
In the above example, "psftappipc" is the name of the Solaris project and "psft" is the OS user who manages PeopleSoft application server.

That's all there is. With the above change, the PeopleSoft application server domain(s) should boot up at least with no Failure to create message queue .. msgget: No space left on device errors.

[Original blog post is at:
http://technopark02.blogspot.com/2008/03/peoplesoft-fixing-msgget-no-space-left.html]

Sunday Nov 30, 2008

PeopleSoft on Solaris 10: Fixing the "msgget: No space left on device" Error

(Crossposting the 8+ month old blog entry from my other blog hosted on blogger. Source URL:
http://technopark02.blogspot.com/2008/03/peoplesoft-fixing-msgget-no-space-left.html
)

When a large number of application server processes are configured in a single PeopleSoft domain or in multiple domains cumulative, it is very likely that the PeopleSoft application server domain boot process may fail with errors like:

Booting server processes ...
exec PSSAMSRV -A -- -C psappsrv.cfg -D CS90SPV -S PSSAMSRV :
        Failed.
113954.ben15!PSSAMSRV.29746.1.0: LIBTUX_CAT:681: ERROR: Failure to create message queue
113954.ben15!PSSAMSRV.29746.1.0: LIBTUX_CAT:248: ERROR: System init function failed, Uunixerr = : 
                   msgget: No space left on device
113954.ben15!tmboot.29708.1.-2: CMDTUX_CAT:825: ERROR: Process PSSAMSRV at ben15 failed with /T 
                   tperrno (TPEOS - operating system error)

In this particular example, the PeopleSoft Enterprise is running on a Solaris 10 system. Fortunately the error message is very clear in this case; and the failure is related to the message queues. During the domain boot up process, there is a call to msgget() to create a message queue. If the call to msgget() succeeds, it returns a non-negative integer that serves as the identifier for the newly created message queue. However in the case of a failure, it returns -1 and sets the error number to EACCES, EEXIST, ENOENT or ENOSPC depending on the underlying reason.

From the above error messages it clear that the msgget() failed with the errno set to ENOSPC (No space left on device). Man page of msgget(2) has the following explanation for ENOSPC error code on Solaris:

ERRORS
     The msgget() function will fail if:
     ...
     ...
     ENOSPC    A message queue identifier is to  be  created  but
               the  system-imposed limit on the maximum number of
               allowed  message  queue  identifiers  system  wide
               would be exceeded. See NOTES.

NOTES
     ...
     ...

     The system-imposed limit on  the  number  of  message  queue
     identifiers  is  maintained on a per-project basis using the
     project.max-msg-ids resource control.

It has enough clues to suspect the configured number for the message queue identifiers.

Prior to the release of Solaris 10, the /etc/system System V IPC tunable, msgsys:msginfo_msgmni, was used to control the maximum number of message queues that can be created. The default value on pre-Solaris 10 systems is 50.

With the release of Solaris 10, majority of the System V IPC tunables were obsoleted and equivalent resource controls were created for the remaining tunables to reduce the administrative overhead. On Solaris 10 and later versions, System V IPC can be tuned on a per project basis using the newly introduced resource controls.

On any Solaris 10 system, the resource control, project.max-msg-ids, replaced the old /etc/system tunable, msginfo_msgmni. And the default value has been raised to 128.

Now back to the failure in PeopleSoft environment. Let's first check the current value configured for project.max-msg-ids.

  • Get the project ID.
     % id -p
    uid=222227(psft) gid=2294(dba) projid=3(default)
  • Examine the project.max-msg-ids resource control for the project with ID 3, using the prctl utility.
     % prctl -n project.max-msg-ids -i project 3
    project: 3: default
    NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
    project.max-msg-ids
            privileged        128       -   deny                                 -
            system          16.8M     max   deny                                 -

Alternatively run the command ipcs -q to check the number of active message queues. Note that the project with id '3' is configured to create a maximum of 128 (default) message queues. In any case, the number of active message queues from the ipcs -q output may almost match with the configured value for the project.max-msg-ids.

Since it appears the configured PeopleSoft domain(s) needs more than 128 message queues in order to bring up all the application server processes that constitute the PeopleSoft Enterprise, the solution is to increase the value for the resource control, project.max-msg-ids, to any value beyond 128. For the sake of simplicity, let's increase it to 256 (2 \* default value, that is). Again prctl utility can be used to set the new value for the resource control.

  • Assume the privileges of the 'root' user
     % su
    Password:
  • Increase the maximum value for the message queue identifiers to 256 using the prctl utility.
     # prctl -n project.max-msg-ids -r -v 256 -i project 3
  • Verify the new maximum value for the message queue identifiers
     # prctl -n project.max-msg-ids -i project 3
    project: 3: default
    NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
    project.max-msg-ids
            privileged        256       -   deny                                 -
            system          16.8M     max   deny                                 -

With this change, the PeopleSoft Enterprise should boot up at least with no Failure to create message queue .. msgget: No space left on device errors.

Before we conclude, note that the above mentioned solution is not persistent across multiple operating system reboots. To make it persistent, create a new project using the projadd command. The man page for projadd(1M) has an example showing the creation of a project.

Friday Nov 21, 2008

Oracle on Solaris 10 : Fixing the 'ORA-27102: out of memory' Error

(Crossposting the 2+ year old blog entry from my other blog hosted on blogger. Source URL:
http://technopark02.blogspot.com/2006/09/solaris-10oracle-fixing-ora-27102-out.html)

Symptom:

As part of a database tuning effort you increase the SGA/PGA sizes; and Oracle greets with an ORA-27102: out of memory error message. The system had enough free memory to serve the needs of Oracle.

SQL> startup
ORA-27102: out of memory
SVR4 Error: 22: Invalid argument

Diagnosis
$ oerr ORA 27102
27102, 00000, "out of memory"
// \*Cause: Out of memory
// \*Action: Consult the trace file for details

Not so helpful. Let's look the alert log for some clues.

% tail -2 alert.log
WARNING: EINVAL creating segment of size 0x000000028a006000
fix shm parameters in /etc/system or equivalent

Oracle is trying to create a 10G shared memory segment (depends on SGA/PGA sizes), but operating system (Solaris in this example) responded with an invalid argument (EINVAL) error message. There is a little hint about setting shm parameters in /etc/system.

Prior to Solaris 10, shmsys:shminfo_shmmax parameter has to be set in /etc/system with maximum memory segment value that can be created. 8M is the default value on Solaris 9 and prior versions; where as 1/4th of the physical memory is the default on Solaris 10 and later. On a Solaris 10 (or later) system, it can be verified as shown below:

% prtconf | grep Mem
Memory size: 32760 Megabytes

% id -p
uid=59008(oracle) gid=10001(dba) projid=3(default)

% prctl -n project.max-shm-memory -i project 3
project: 3: default
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
project.max-shm-memory
        privileged      7.84GB      -   deny                                 -
        system          16.0EB    max   deny                                 -

Now it is clear that the system is using the default value of 8G in this scenario, where as the application (Oracle) is trying to create a memory segment (10G) larger than 8G. Hence the failure.

So, the solution is to configure the system with a value large enough for the shared segment being created, so Oracle succeeds in starting up the database instance.

On Solaris 9 and prior releases, it can be done by adding the following line to /etc/system, followed by a reboot for the system to pick up the new value.

set shminfo_shmmax = 0x000000028a006000

However shminfo_shmmax parameter was obsoleted with the release of Solaris 10; and Sun doesn't recommend setting this parameter in /etc/system even though it works as expected.

On Solaris 10 and later, this value can be changed dynamically on a per project basis with the help of resource control facilities . This is how we do it on Solaris 10 and later:

% prctl -n project.max-shm-memory -r -v 10G -i project 3

% prctl -n project.max-shm-memory -i project 3
project: 3: default
NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT
project.max-shm-memory
        privileged      10.0GB      -   deny                                 -
        system          16.0EB    max   deny                                 -

Note that changes made with the prctl command on a running system are temporary, and will be lost when the system is rebooted. To make the changes permanent, create a project with projadd command and associate it with the user account as shown below:

% projadd -p 3  -c 'eBS benchmark' -U oracle -G dba  -K 'project.max-shm-memory=(privileged,10G,deny)' OASB
% usermod -K project=OASB oracle
Finally make sure the project is created with projects -l or cat /etc/project commands.

% projects -l
...
...
OASB
        projid : 3
        comment: "eBS benchmark"
        users  : oracle
        groups : dba
        attribs: project.max-shm-memory=(privileged,10737418240,deny)

% cat /etc/project
...
...
OASB:3:eBS benchmark:oracle:dba:project.max-shm-memory=(privileged,10737418240,deny)

With these changes, Oracle would start the database up normally.

SQL> startup
ORACLE instance started.

Total System Global Area 1.0905E+10 bytes
Fixed Size                  1316080 bytes
Variable Size            4429966096 bytes
Database Buffers         6442450944 bytes
Redo Buffers               31457280 bytes
Database mounted.
Database opened.

Related information:

  1. What's New in Solaris System Tuning in the Solaris 10 Release?
  2. Resource Controls (overview)
  3. System Setup Recommendations for Solaris 8 and Solaris 9
  4. Man page of prctl(1)
  5. Man page of projadd


Addendum : Oracle RAC settings

Anonymous Bob suggested the following settings for Oracle RAC in the form of a comment for the benefit of others who run into similar issue(s) when running Oracle RAC. I'm pasting the comment as is (Disclaimer: I have not verified these settings):

Thanks for a great explanation, I would like to add one comment that will help those with an Oracle RAC installation. Modifying the default project covers oracle processes great and is all that is needed for a single instance DB. In RAC however, the CRS process starts the DB and it is a root owned process and root does not use the default project. To fix ORA-27102 issue for RAC I added the following lines to an init script that runs before the init.crs script fires.

# Recommended Oracle RAC system params
ndd -set /dev/udp udp_xmit_hiwat 65536
ndd -set /dev/udp udp_recv_hiwat 65536

# For root processes like crsd
prctl -n project.max-shm-memory -r -v 8G -i project system
prctl -n project.max-shm-ids -r -v 512 -i project system

# For oracle processes like sqlplus
prctl -n project.max-shm-memory -r -v 8G -i project default
prctl -n project.max-shm-ids -r -v 512 -i project default

So simple yet it took me a week working with Oracle and SUN to come up with that answer...Hope that helps someone out.

Bob
# posted by Blogger Bob : 6:48 AM, April 25, 2008

About

Benchmark announcements, HOW-TOs, Tips and Troubleshooting

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today