Friday May 15, 2015

Deleting stale VNICs

When there are stale VNICs in an Exalogic rack, a "Stale VNICs are present in the switches" warning is thrown in Exachk report. When looking deeper into that warning, the following risk can be seen:
VNICs in states other than "UP" can cause network outages. In a virtual rack, excessive number of unused vNICs can cause performance issues.

Sometimes the steps to delete these stale VNICs can be confusing to some guys, so I am posting here simple steps to achieve it.

The syntax of the deletevnic command is:
# deletevnic connector vnic_id

Ok, but how can I know the connector and vnic_id of each stale VNIC?
Well, they are listed in Exachk.
But also, you can check what VNICs are stale by running the following command:
# showvnics|grep -i WAIT-IOA
And then you will see the vnic_id in the first column and the conector in the last column.

The output would be something like:
74 WAIT-IOA N 27ABA3F429048 rackcn02 0000 00:14:4F:FB:70:D7 13 0x8007 0A-ETH-2
30 WAIT-IOA N 5E6C3186A4361 rackcn04 0000 00:14:4F:FA:50:D4 13 0x8007 0A-ETH-3
51 WAIT-IOA N BDC7F7A61E5E0 rackcn01 0000 00:14:4F:FA:91:3F 13 0x8007 0A-ETH-3
16 WAIT-IOA N A584D5DA41538 rackcn01 0000 00:14:4F:F8:8D:84 13 0x8007 0A-ETH-4

But, as several times the VNICs are associated to compute nodes, some people may think that deleting the VNIC would affect the networking performance of the compute node.
The answer for that perfectly valid concern is that, as the vnic_id is unique, then you can safely use the deletevnic command for each one of the stale VNICs.

Tuesday Mar 17, 2015

MOS Note 1963189.1 has been Improved with a Workaround

About 2 months ago, I created the MOS Note 1963189.1 - "OEM (Enterprise Manager) Reporting IB Switch Ports As Being Disconnected On Exalogic Physical Compute Nodes Although No Issues With IB Links/Ports".

Originally describes a situation where some false-positive messages are thrown by OEM 12c, incorrectly reporting IB ports as disconnected. Those messages are harmless and can be safely ignored.

But now, an easy to implement workaround to avoid these messages has been included on this note as well.

Monday Mar 09, 2015

PSU, Patching and Classpath Problems on Weblogic Server

When applying a new PSU on a Weblogic Server, there are some facts that you must keep in mind
- First of all, never install a PSU over another one. As can be seen on the MOS Note 1573509.2: “Each PSU will conflict with any prior PSU in the series. To install a subsequent PSU, any existing PSU must first be uninstalled.”.
- And what happens if I am unable to uninstall a PSU? That issue is addressed by the MOS Note 1349322.1.
- It can also happen that when attempting to apply a patch, a conflict message like the following is thrown: "Patch A is mutually exclusive and cannot coexist with patch(es): B", and when trying to remove patch B, Smart Update fails with a message "Patch not installed: B". Such situations are described on MOS Note 1970064.1.
- Avoid having different PSU levels on the Weblogic, even if you have multiple domains or clustered across different physical machines.

And why could the above facts be related to the Classpath?
As can be seen on MOS Note 1509703.1, there could be situations where, after applying a WebLogic Server 10.3.6 PSU, a managed server fails to start when the classpath is provided and started from the admin console. A critical BEA-000362 message is thrown:
<BEA-000362> <Server failed. Reason: [Management:141266]Parsing Failure in config.xml: failed to find method MethodName{methodName='setCacheInAppDirectory', paramTypes=[boolean]} on class>
This happens because in the PSUs for WLS 10.3.6 (, every time an application deployment is done, it adds a <cache-in-app-directory> element into the config.xml file for that application. To parse this new element, the classes for the PSU must be loaded rather than the original classes for application deployment. So specifying WL_HOME/server/lib/weblogic_sp.jar;WL_HOME/server/lib/weblogic.jar in the classpath of Server-start may cause the problem. There is no need to set these in the classpath of Server-start since they will come in from the system classpath. The weblogic_patch.jar must precede weblogic_sp.jar and weblogic.jar -- this ensures that the classes in the patch are loaded rather than the unpatched classes.
The already mentioned MOS Note 1509703.1 contains an additional procedure for deployments after applying a PSU.

Same jars/classes included multiple times on a Classpath
Also, note that sometimes same jars/classes may appear multiple times on a classpath (this happens mainly because the command lines are modified as any other component of the WLS architecture as time goes by with new versions of Oracle products and customer's apps). The JVM searches for them according to the specified order, and it would be correct in general terms. However, it will depend on the implementation of the classloader. But there are some potential problems, for example:
- When loading classes within a web framework the deployed jar/war/ear/sar files may be checked before the official classpath.
- And what would happen if two different versions of a same jar are invoked?

Tuesday Dec 23, 2014

The vServers were migrated to another Compute Node... Oh My!

Recently I was working 2 similar SRs, both related to this behavior of Exalogic.

When for some reason a compute node is rebooted, the vServers running on it are automatically moved to other compute nodes. Why does this happen? This will occur if the HA flag is enabled on a vServer, which can verified by looking at vm.cfg file of that VM.

After the compute node that was rebooted is back up and running, you may probably want those migrated vServers to be located where they were before the outage (that is, on the previously failed compute node). Unfortunately, "Live Migration" or the ability to migrate running vServers is not yet supported in Exalogic. By design, migration is not exposed at the vDC level, a vDC cloud user does not care about where his vServer runs, it runs in the cloud. So, you need to use "Cold Migration" instead.

Basically, cold migration may be executed manually by admin with "root" privileges on OVS node:
- stop a vServer, it will be detached from OVS
- start a vServer on the OVS node you want your vServer to run, making sure you do not switch server pools

Wednesday Sep 17, 2014

Working with SSL Certificates in OTD

An issue in Oracle Traffic Director (OTD) that has become somewhat common, is to get SSL certificate warnings similar to the one below:
SSL server certificate Admin-Server-Cert is expired.

This typically happens if the Admin SSL CA Cert has expired. So, to prevent this, the CA/SSL certificates should be renewed before their expiry dates by extending it, which could be from 1 to 10 years. There are 2 approaches:
1. To artificially set the Admin-Server host clock
2. To create a new Admin server to replace the old one (but may lose old configured SSL keys)

However, at that point it may also happen that you get a certificate for one year and would like it for ten years. And even when the the command below runs successfully, the expire dates are not changed:
./bin/tadm renew-admin-certs --user= --port= --validity=120

The problem there is that without applying the latest patch, currently the Admin Node(s) certificate will be valid for only 1 year and it requires renewal each year. So, to avoid renewing the Admin Node(s) certificate every year, you need to apply the patch MLR#2 (Apr 2014) for OTD version or later. After the patch, the startup banner will have a proper new date, and when you renew Admin Server certificates will also renew the Admin Nodes(s) certificates for same number of years.

For further information, please take a look at the following MOS notes:
- Oracle Traffic Director OTD Cannot Communicate Between Admin Server & Administration Node (Doc ID 1561339.1)
- Oracle Traffic Director Admin Server and Admin Node Certificate Validity (Doc ID 1603520.1)
- How to Renew Admin Server SSL Certificate for Oracle Traffic Director? (Doc ID 1549253.1)
- Available Versions, Patches, and Updates for Download for Oracle Traffic Director (OTD) (Doc ID 1676256.1)

Tuesday Aug 12, 2014

A "ZFS Storage Appliance is not reachable" message is thrown by Exachk and the ZFS checks are skipped. WAD?

Sometimes it may happen that something like the following can be seen on the "Skipped Nodes" section:

Host NameReason
myexalogic01sn01-mgm1ZFS Storage Appliance is not reachable
myexalogic01sn01-mgm2ZFS Storage Appliance is not reachable

Also, a message like the following can be seen when executing Exachk:
Could not find infiniband gateway switch names from env or configuration file.
Please enter the first gateway infiniband switch name :
Could not find storage node names from env or configuration file.
Please enter the first storage server :

This is because the way Exachk works on this is based on the standard naming convention of "<rack-name>sn0x" format.

To solve this, make sure there is an o_storage.out file in the directory where Exachk is installed. If the file is missing, create a blank one.

The o_storage.out must contain the right storage nodes hostnames in the format they have in hosts file. This format should typically be something like "<rack-name>sn0x-mgmt" For example an o_storage.out should look quite simply as below:

This way it is ensured that the o_storage.out file has valid ZFS Storage Appliance hostnames.

If the switch checks are skipped, then a similar procedure should be performed with the o_switches.out file.

Thursday Jun 26, 2014

Data source in suspended state: BEA-001156 error because maximum number of sessions was exceeded in the database (ORA-00018)

Recently, I worked a Service Request where a data source was in suspended state. In the log files it could be seen a BEA-001156 error message, and the stack trace (obviously shortened in this example) contained something like the following:
<BEA-001156> <Stack trace associated with message 001129 follows:
java.sql.SQLException: ORA-00018: maximum number of sessions exceeded
at oracle.jdbc.driver.T4CTTIoer.processError(
at oracle.jdbc.driver.T4CTTIoer.processError(
at oracle.jdbc.driver.T4CTTIoer.processError(
at oracle.jdbc.driver.T4CTTIoauthenticate.processError(
at oracle.jdbc.driver.T4CTTIfun.receive(
at oracle.jdbc.driver.T4CTTIfun.doRPC(
at oracle.jdbc.driver.T4CTTIoauthenticate.doOSESSKEY(
at oracle.jdbc.driver.T4CConnection.logon(
at oracle.jdbc.driver.PhysicalConnection.(
at oracle.jdbc.driver.T4CConnection.(
at oracle.jdbc.driver.T4CDriverExtension.getConnection(
at oracle.jdbc.driver.OracleDriver.connect(
at oracle.jdbc.pool.OracleDataSource.getPhysicalConnection(
at oracle.jdbc.xa.client.OracleXADataSource.getPooledConnection(
at oracle.jdbc.xa.client.OracleXADataSource.getXAConnection(
at oracle.jdbc.xa.client.OracleXADataSource.getXAConnection(

Seeing at the error message at the top, it is clearly a session handling problem at database level. Note that, depending on how your application is designed/programmed, recursive sessions can be created and sometimes it could be hard to track all of them, even more in periods of high load.

When this type of issue occur, the most common solution is to increase the SESSIONS parameter of the init.ora configuration file.

It is usually recommended to preserve 50% of the SESSIONS value for recursive sessions.

Wednesday Dec 18, 2013

"Cannot allocate memory" message when accessing a Compute Node through SSH, despite ILOM shows available memory

We recently worked an issue where it happened that when accessing the server through ssh it threw a "Cannot allocate memory" message, however ILOM showed available memory.

This happened due to a known bug with a package, related with the ypserv utility:

The problem is that even though there is enough free memory, it is too fragmented to allocate two contiguous pages.

The command below can be used:
# echo 3 > /proc/sys/vm/drop_caches
This may allow for memory to become defragmented enough for further fork() system calls to succeed otherwise it may be necessary to reboot the system.

Wednesday Nov 13, 2013

Unable to change NIS password

On the last few weeks I worked a Service Request where Linux user was unable to change user passwords NIS. As a matter of fact, NIS is used by several people on their Exalogic environments.

[root@computenode ~]# passwd user01
Changing password for user user01.
passwd: Authentication token manipulation error
[root@computenode ~]#
In this case, "user01" is an example username.

This issue may occur on all NIS nodes and even on master server as well.
This error typically corresponds to typos or missing keywords in configuration files from the /etc/pam.d directory.
On the Service Request that I worked, the file system-auth-ac had no nis keyword:

# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth required
auth sufficient nullok try_first_pass
auth requisite uid >= 500 quiet
auth required

account required
account sufficient uid < 500 quiet
account required

password requisite try_first_pass retry=3
password sufficient md5 shadow remember=5 nullok try_first_pass use_authtok
password required

session optional revoke
session required
session [success=1 default=ignore] service in crond quiet use_uid
session required

This kind of issue has also been described in a case where the "" line had a typo ("sufficent" instead of correct "sufficient") on the su file.

To solve this kind of issues, first the typos must be (obviously) fixed.
For this NIS case, it is necessary to make sure that keyword is added:
password sufficient md5 shadow nis remember=5 nullok try_first_pass use_authtok

Note that these settings should be consistent across all the NIS nodes (master server and clients).

Monday Jul 29, 2013

Coherence on Exalogic: dealing with the multiple network interfaces

Recently, we worked an incident where error messages like the following were being thrown when starting the Coherence servers after an upgrade of EECS:
Oracle Coherence GE (thread=Thread-3, member=n/a): Loaded Reporter configuration from "jar:file:/u01/app/fmw_product/wlserver_103/coherence_3.7/lib/coherence.jar!/reports/report-group.xml"
Exception in thread "Thread-3" java.lang.IllegalArgumentException: unresolvable localhost at
Caused by: java.rmi.server.ExportException: Listen failed on port: 8877; nested exception is: Address already in use ...
Caused by: is not a local address

It is a very known fact that Exalgic has several network interfaces (bond/eth 0,1,2, etc). The logic that Coherence uses when deciding what interface to connect to, specifically to support machines with multiple network interfaces as well as enhancements to allow the localaddress to be specified as a netmask to make configuration across larger clusters easier, makes important (even more than on previuous releases of Coherence) to make sure that the tangosol.coherence.localhost parameter is specified appropriately. From that IP address (or properly mapped host address) the desired network interface to be used can easily be found and then the Coherence cluster would work fine on it.

Thursday Mar 28, 2013

ClassCastException thrown when running Coherence with Exabus IMB enabled

Today I worked a Service Request which was a Coherence issue on an Exalogic platform. It is a very interesting issue (at least for me ).

An exception message like the following is thrown when running Coherence with IMB on a WLS server:

The cause of this problem is that Coherence runs into a classloading issue when:
- using to enforce the child-first classloading
- coherence.jar is both in the system classpath and application classpath
- and Exabus IMB is enabled

In newer versions of WLS (12c), coherence.jar is in system classpath, so by default Coherence classes will be loaded from the system classpath. For situations where is required child first class loading semantics, and should be specified over configuration inside weblogic.xml to change the classloading order.

To solve this, add the following into weblogic.xml:

Friday Jan 18, 2013

Two New Exalogic ZFS Storage Appliance MOS Notes

This week I have closed 2 Service Requests related to the ZFS Storage Appliance and I created My Oracle Support (MOS) notes from both of them as, despite they were not complicated issues and the SRs were both closed in less than one week, these procedures were still not formally documented on MOS. Below can be seen the information about these created documents.

MOS Doc Id. 1519858.1 - Will The Restart Of The NIS Service On The ZFS Storage Appliance Affect The Mounted Filesystems?

On this case, for a particular reason it was necessary to restart the NIS service. So, if for any reason, the NIS service needs to be restarted on the ZFS Storage Appliance, will the mounted filesystems be affected during the restart?

The default cluster configuration type of the ZFS storage appliance is active-passive and the storage nodes are supposed to be mirrored, so the restart of NIS should not be causing any issues; it can be done.

Note that restart of NIS should be done on the active storage head. Restarting the NIS itself will not cause any ZFS failover from Active to Passive.

In general terms, even in the event of a storage node failure, the appliance will automatically fail over to the other storage node. Under that condition, an initial degradation in performance can be expected because all of the cached data on the failed node is gone, but this effect decreases as the new active storage node begins caching data in its own SSDs.

MOS Doc Id. 1520223.1 - Exalogic Storage Nodes Hostnames Are Displayed Incorrectly

This was not the first time I saw something like this, so decided to create a note because clearly is a problem that may affect to more than one Exalogic user.

The Exalogic storage node hostnames displayed on the BUI were different than the ones displayed when accessing the node through SSH or ILOM.

This happens because for any reason the hostname is misconfigured on the ZFS Storage Appliance.

To solve this problem, it is necessary to set the system name and location accordingly on the Storage Appliance nodes BUI:
1. Login on the ZFS Storage Appliance BUI
2. Go to the "Configuration" tab, and select the "Services" subtab
3. Under the "Systems Settings" section, click on "System Identity"
4. Set the system name and location accordingly

Wednesday Dec 12, 2012

Managed servers getting down regularly by Node Manager. WAD?

Recently I have been working on a service request where several instances were running, and several technologies were being used, including SOA, BAM, BPEL and others.

At a first glance, this may seem to be a Node Manager problem. But on this situation, the problem was actually at JMS - Persistent Store level. Node Manager can automatically restart Managed Servers that have the "failed" health state, or have shut down unexpectedly due to a system crash or reboot. As a matter of fact, from the provided log files it was clear that the instance was becoming unhealthy because of a Persistent Store problem.

So finally, the problem here was not with Node Manager as it was working as designed, and the restart was being caused by the Persistent Store. After this Persistent Store problem was fixed, everything went fine.

This particular issue that I worked was on an Exalogic machine, but note that this may happen on any hardware running Weblogic.

Friday May 18, 2012

CONNECTION_REFUSED messages on load balancing in Weblogic with OHS or Apache

In the last months I have had to work on some issues related to load balancing. It is very important to understand how the layers interact between them and where specific settings must be done.

Some people gets upset with the fact that OHS/Apache do load balancing even to servers that are shutdown and may be losing transactions.

This document provides very good tips about how many Production critical issues can be resolved just by setting the appropriate values for some parameters.

Personally, I think that the DynamicServerList parameter (which is in fact the first one mentioned on the document linked above) is particularly important to understand. As can be seen at this documentation from Oracle:
In a clustered environment, a plug-in may dispatch requests to an unavailable WebLogic Server instance because the DynamicServerList is not current in all plug-in processes.
DynamicServerList=ON works with a single Apache server (httpd daemon process), but for more than one server, such as StartServers=5, the dynamic server list will not be updated across all httpd instances until they have all tried to contact a WebLogic Server instance. This is because they are separate processes. This delay in updating the dynamic server list could allow an Apache httpd process to contact a server that another httpd process has marked as dead. Only after such an attempt will the server list will be updated within the proxy. One possible solution if this is undesirable is to set the DynamicServerList to OFF.
In a non-clustered environment, a plug-in may lose the stickiness of a session created after restarting WebLogic Server instances, because some plug-in processes do not have the new JVMID of those restarted servers, and treat them as unknown JVMIDs.
To avoid these issues, upgrade to Apache 2.0.x and configure Apache to use the multi-threaded and single-process model, mpm_worker_module.

Also, this Oracle documentation provides inportant information about "Failover, Cookies, and HTTP Sessions", and "Tuning to Reduce Connection_Refused Errors".

As can be seen at this Apache document, the MaxRequestsPerChild directive sets the limit on the number of requests that an individual child server will handle during its life.

Note that mod_proxy and related modules implement a proxy/gateway for Apache HTTP Server, supporting a number of popular protocols as well as several different load balancing algorithms. Third-party modules can add support for additional protocols and load balancing algorithms.

On Oracle Forums I also found a very interesting thread:
The error which you are getting is a common which can be fixed by increasing the "AcceptBackLog" value by 25% until error disappears from weblogic console (Path: Servers => => Configuration tab=> Tuning sub-tab.) and setting the value to ON for "KeepAlive" in the httpd.conf which should take care of your issue.
Topic: Tuning Connection Backlog Buffering
Search for "KeepAliveEnabled":
Also here is a link which would be helpful to understand some common issue which occurs when using a plug-in and there are solutions:

May transactions be affected because of this?
Certainly yes, but it depends on how your application is developed. A good practice would be to create a bunch of transactions and track them to check if some are missed or not. This Transaction and redelivery in JMS article may be helpful.

Wednesday Apr 18, 2012

REP-0178 - "Reports Server cannot establish connection" Error Message

During this last week I saw an interesting thread in Oracle Forums about this error message, and wanted to share the findings that I got to answer in the forum thread:
REP-0178: Reports Server [server_name] cannot establish connection

This problem may occur when a wrong rwclient is picked. Perhaps the environment is not set appropriately before rwclient is called or the rwclient.bat/ in /bin is found and used, but it is only a template that upon install allowed to create the valid rwclient.bat/ in the /config/reports/bin (in fact when the instance was actually configured).

So you can try using the appropriate rwclient.bat/ as it calls rwclient.exe/rwclient after setting the environment. Either set in the PATH the directory /config/reports/bin before /bin or specify the full path to rwclient.bat/

Another possibility can be that the services were started as root and therefore some log files have been created with the user root. Hence there is no more write access to theses log files for the basic Oracle user (which is the owner of the installation):

If that's the case, then try to change the owner of the following log files to ORACLE user (which is the owner of the installation):
And then restart the report server and run again the command to generate the reports.

Principal Technical Support Engineer in the Engineered (Systems) Enterprise Support Team - EEST.
Former member of the Coherence and Java Technologies Support Teams.


« May 2015