Friday May 15, 2015

Deleting stale VNICs

When there are stale VNICs in an Exalogic rack, a "Stale VNICs are present in the switches" warning is thrown in Exachk report. When looking deeper into that warning, the following risk can be seen:
VNICs in states other than "UP" can cause network outages. In a virtual rack, excessive number of unused vNICs can cause performance issues.

Sometimes the steps to delete these stale VNICs can be confusing to some guys, so I am posting here simple steps to achieve it.

The syntax of the deletevnic command is:
# deletevnic connector vnic_id

Ok, but how can I know the connector and vnic_id of each stale VNIC?
Well, they are listed in Exachk.
But also, you can check what VNICs are stale by running the following command:
# showvnics|grep -i WAIT-IOA
And then you will see the vnic_id in the first column and the conector in the last column.

The output would be something like:
74 WAIT-IOA N 27ABA53803F429048 rackcn02 0000 00:14:4F:FB:70:D7 13 0x8007 0A-ETH-2
30 WAIT-IOA N 5E6C31804046A4361 rackcn04 0000 00:14:4F:FA:50:D4 13 0x8007 0A-ETH-3
51 WAIT-IOA N BDC7F7A200761E5E0 rackcn01 0000 00:14:4F:FA:91:3F 13 0x8007 0A-ETH-3
16 WAIT-IOA N A584D51705DA41538 rackcn01 0000 00:14:4F:F8:8D:84 13 0x8007 0A-ETH-4

But, as several times the VNICs are associated to compute nodes, some people may think that deleting the VNIC would affect the networking performance of the compute node.
The answer for that perfectly valid concern is that, as the vnic_id is unique, then you can safely use the deletevnic command for each one of the stale VNICs.

Tuesday Mar 17, 2015

MOS Note 1963189.1 has been Improved with a Workaround

About 2 months ago, I created the MOS Note 1963189.1 - "OEM (Enterprise Manager) Reporting IB Switch Ports As Being Disconnected On Exalogic Physical Compute Nodes Although No Issues With IB Links/Ports".

Originally describes a situation where some false-positive messages are thrown by OEM 12c, incorrectly reporting IB ports as disconnected. Those messages are harmless and can be safely ignored.

But now, an easy to implement workaround to avoid these messages has been included on this note as well.

Monday Mar 09, 2015

PSU, Patching and Classpath Problems on Weblogic Server

When applying a new PSU on a Weblogic Server, there are some facts that you must keep in mind
- First of all, never install a PSU over another one. As can be seen on the MOS Note 1573509.2: “Each PSU will conflict with any prior PSU in the series. To install a subsequent PSU, any existing PSU must first be uninstalled.”.
- And what happens if I am unable to uninstall a PSU? That issue is addressed by the MOS Note 1349322.1.
- It can also happen that when attempting to apply a patch, a conflict message like the following is thrown: "Patch A is mutually exclusive and cannot coexist with patch(es): B", and when trying to remove patch B, Smart Update fails with a message "Patch not installed: B". Such situations are described on MOS Note 1970064.1.
- Avoid having different PSU levels on the Weblogic, even if you have multiple domains or clustered across different physical machines.

And why could the above facts be related to the Classpath?
As can be seen on MOS Note 1509703.1, there could be situations where, after applying a WebLogic Server 10.3.6 PSU, a managed server fails to start when the classpath is provided and started from the admin console. A critical BEA-000362 message is thrown:
<BEA-000362> <Server failed. Reason: [Management:141266]Parsing Failure in config.xml: failed to find method MethodName{methodName='setCacheInAppDirectory', paramTypes=[boolean]} on class>
This happens because in the PSUs for WLS 10.3.6 (, every time an application deployment is done, it adds a <cache-in-app-directory> element into the config.xml file for that application. To parse this new element, the classes for the PSU must be loaded rather than the original classes for application deployment. So specifying WL_HOME/server/lib/weblogic_sp.jar;WL_HOME/server/lib/weblogic.jar in the classpath of Server-start may cause the problem. There is no need to set these in the classpath of Server-start since they will come in from the system classpath. The weblogic_patch.jar must precede weblogic_sp.jar and weblogic.jar -- this ensures that the classes in the patch are loaded rather than the unpatched classes.
The already mentioned MOS Note 1509703.1 contains an additional procedure for deployments after applying a PSU.

Same jars/classes included multiple times on a Classpath
Also, note that sometimes same jars/classes may appear multiple times on a classpath (this happens mainly because the command lines are modified as any other component of the WLS architecture as time goes by with new versions of Oracle products and customer's apps). The JVM searches for them according to the specified order, and it would be correct in general terms. However, it will depend on the implementation of the classloader. But there are some potential problems, for example:
- When loading classes within a web framework the deployed jar/war/ear/sar files may be checked before the official classpath.
- And what would happen if two different versions of a same jar are invoked?

Tuesday Dec 23, 2014

The vServers were migrated to another Compute Node... Oh My!

Recently I was working 2 similar SRs, both related to this behavior of Exalogic.

When for some reason a compute node is rebooted, the vServers running on it are automatically moved to other compute nodes. Why does this happen? This will occur if the HA flag is enabled on a vServer, which can verified by looking at vm.cfg file of that VM.

After the compute node that was rebooted is back up and running, you may probably want those migrated vServers to be located where they were before the outage (that is, on the previously failed compute node). Unfortunately, "Live Migration" or the ability to migrate running vServers is not yet supported in Exalogic. By design, migration is not exposed at the vDC level, a vDC cloud user does not care about where his vServer runs, it runs in the cloud. So, you need to use "Cold Migration" instead.

Basically, cold migration may be executed manually by admin with "root" privileges on OVS node:
- stop a vServer, it will be detached from OVS
- start a vServer on the OVS node you want your vServer to run, making sure you do not switch server pools

Wednesday Dec 17, 2014

Checking the Oracle JDBC Driver Version on a Weblogic Server

The JDBC driver is typically located at the location WL_HOME/server/lib of the installation directory. The file is ojdbc7.jar or ojdbc6.jar (for new versions of WLS), or ojdbc14.jar (for older versions of WLS).

- One way to check the JDBC driver version is to open the ojdbc jar file and go inside the META-INF folder, and then open the "MANIFEST.MF" file. The version can be seen next to "Specification-Version".

- Another way is to run the command below on the location mentioned previously:
   java -jar ojdbc6.jar -getversion

Note that you must use the JDBC JAR file intended for the version of the JDK that you are running. For example, "ojdbc5.jar" is intended for use with JDK 1.5. So if you run JDK 1.5 with the JDBC driver JAR file "ojdbc6.jar" then a "java.lang.UnsupportedClassVersionError: Bad version number in .class file" error message will be thrown when performing this check.

Wednesday Sep 17, 2014

Working with SSL Certificates in OTD

An issue in Oracle Traffic Director (OTD) that has become somewhat common, is to get SSL certificate warnings similar to the one below:
SSL server certificate Admin-Server-Cert is expired.

This typically happens if the Admin SSL CA Cert has expired. So, to prevent this, the CA/SSL certificates should be renewed before their expiry dates by extending it, which could be from 1 to 10 years. There are 2 approaches:
1. To artificially set the Admin-Server host clock
2. To create a new Admin server to replace the old one (but may lose old configured SSL keys)

However, at that point it may also happen that you get a certificate for one year and would like it for ten years. And even when the the command below runs successfully, the expire dates are not changed:
./bin/tadm renew-admin-certs --user= --port= --validity=120

The problem there is that without applying the latest patch, currently the Admin Node(s) certificate will be valid for only 1 year and it requires renewal each year. So, to avoid renewing the Admin Node(s) certificate every year, you need to apply the patch MLR#2 (Apr 2014) for OTD version or later. After the patch, the startup banner will have a proper new date, and when you renew Admin Server certificates will also renew the Admin Nodes(s) certificates for same number of years.

For further information, please take a look at the following MOS notes:
- Oracle Traffic Director OTD Cannot Communicate Between Admin Server & Administration Node (Doc ID 1561339.1)
- Oracle Traffic Director Admin Server and Admin Node Certificate Validity (Doc ID 1603520.1)
- How to Renew Admin Server SSL Certificate for Oracle Traffic Director? (Doc ID 1549253.1)
- Available Versions, Patches, and Updates for Download for Oracle Traffic Director (OTD) (Doc ID 1676256.1)

Wednesday Aug 20, 2014

How to set a Static Route on a Storage Node

To set up a host route to an IP address, here are the procedures for BUI and CLI. You need to know the destination, mask, interface and network. Note that, in this case, the values are just examples.

- Log into CLI and run the commands below:
configuration net routing create
set family=IPv4
set destination=
set mask=32
set gateway=
set interface=igb0

- Log in to the web ui of the ZFSSA NAS head
- Click Configuration -> Network -> Routing -> (+)
- In the popup window that will be displayed, enter the values accordingly on the popup window shown on the screenshot below:

Any of the two above procedures should get your desired route in place.

Tuesday Aug 12, 2014

A "ZFS Storage Appliance is not reachable" message is thrown by Exachk and the ZFS checks are skipped. WAD?

Sometimes it may happen that something like the following can be seen on the "Skipped Nodes" section:

Host NameReason
myexalogic01sn01-mgm1ZFS Storage Appliance is not reachable
myexalogic01sn01-mgm2ZFS Storage Appliance is not reachable

Also, a message like the following can be seen when executing Exachk:
Could not find infiniband gateway switch names from env or configuration file.
Please enter the first gateway infiniband switch name :
Could not find storage node names from env or configuration file.
Please enter the first storage server :

This is because the way Exachk works on this is based on the standard naming convention of "<rack-name>sn0x" format.

To solve this, make sure there is an o_storage.out file in the directory where Exachk is installed. If the file is missing, create a blank one.

The o_storage.out must contain the right storage nodes hostnames in the format they have in hosts file. This format should typically be something like "<rack-name>sn0x-mgmt" For example an o_storage.out should look quite simply as below:

This way it is ensured that the o_storage.out file has valid ZFS Storage Appliance hostnames.

If the switch checks are skipped, then a similar procedure should be performed with the o_switches.out file.

Thursday Jun 26, 2014

Data source in suspended state: BEA-001156 error because maximum number of sessions was exceeded in the database (ORA-00018)

Recently, I worked a Service Request where a data source was in suspended state. In the log files it could be seen a BEA-001156 error message, and the stack trace (obviously shortened in this example) contained something like the following:
<BEA-001156> <Stack trace associated with message 001129 follows:
java.sql.SQLException: ORA-00018: maximum number of sessions exceeded
at oracle.jdbc.driver.T4CTTIoer.processError(
at oracle.jdbc.driver.T4CTTIoer.processError(
at oracle.jdbc.driver.T4CTTIoer.processError(
at oracle.jdbc.driver.T4CTTIoauthenticate.processError(
at oracle.jdbc.driver.T4CTTIfun.receive(
at oracle.jdbc.driver.T4CTTIfun.doRPC(
at oracle.jdbc.driver.T4CTTIoauthenticate.doOSESSKEY(
at oracle.jdbc.driver.T4CConnection.logon(
at oracle.jdbc.driver.PhysicalConnection.(
at oracle.jdbc.driver.T4CConnection.(
at oracle.jdbc.driver.T4CDriverExtension.getConnection(
at oracle.jdbc.driver.OracleDriver.connect(
at oracle.jdbc.pool.OracleDataSource.getPhysicalConnection(
at oracle.jdbc.xa.client.OracleXADataSource.getPooledConnection(
at oracle.jdbc.xa.client.OracleXADataSource.getXAConnection(
at oracle.jdbc.xa.client.OracleXADataSource.getXAConnection(

Seeing at the error message at the top, it is clearly a session handling problem at database level. Note that, depending on how your application is designed/programmed, recursive sessions can be created and sometimes it could be hard to track all of them, even more in periods of high load.

When this type of issue occur, the most common solution is to increase the SESSIONS parameter of the init.ora configuration file.

It is usually recommended to preserve 50% of the SESSIONS value for recursive sessions.

Wednesday Jun 11, 2014

Setting MTU on Exalogic

For many reasons, a system administrator may want to change the MTU settings of a server. But in a system like Exalogic which contains lots of interconnected nodes and other various components, it's important to understand how this applies to the different networks.

For example, when bringing up bonding of InfiniBand an error like the following may be thrown:
Bringing up interface bond1: SIOCSIFMTU: Invalid argument
Both scripts ifcfg-ib0 and ifcfg-ib1 (from the /etc/sysconfig/network-scripts/ direectory) have MTU set to 65500, which is a valid MTU value only if all IPoIB slaves operate in connected mode and are configured with the same value, so the line below must be added to both network scripts and then restart the network:

By the way, an error of the form “SIOCSIFMTU: Invalid argument” indicates that the requested MTU was rejected by the kernel. Typically this would be due to it exceeding the maximum value supported by the interface hardware. In that case you must either reduce the MTU to a value that is supported or obtain more capable hardware. This problem has been seen when trying to modify the MTU using the ifconfig command, like the output of the example below:
[root@elxxcnxx ~]# ifconfig ib1 mtu 65520
SIOCSIFMTU: Invalid argument

It's important to insist that in most cases the nodes must be rebooted after the MTU size has been changed. Although in some circumstances it may work without a reboot, it is not how it is typically documented.

Now, in order to achieve a reduced memory consumption and improve performance for network traffic received on IPoIB related interfaces, it is recommend to reduce the MTU value in interface configuration files for IPoIB related bonds from 65520 to 64000. The change needs to be made to interface configuration files under the /etc/sysconfig/network-scripts directory and applies to the interface configuration files for bonds over IPoIB related slave devices, for example /etc/sysconfig/network-scripts/ifcfg-bond1. However, keep in mind that the numeric portion of the interface filenames that corresponding to IPoIB interfaces is expected to vary across compute nodes and vServers and so cannot be relied upon to identify which interface files are for bonds are over IPoIB rather than EoIB related slave interfaces.
To fix these MTU values to the recommended settings, there are very useful instructions and a script on the MOS Note 1624434.1, and it's applicable physical and virtual configurations of Exalogic.

Regarding the recommended MTU value for EoIB related interfaces, its maximum appropriate value is 1500. If for some reason a vServer has been created with a higher value (set on the /etc/sysconfig/network-scripts/ifcfg-bond0 file), then it must be fixed. An error like the following could be thrown under this circumstance:
[root@vServer ~]# service network restart
Bringing up interface bond0:  SIOCSIFMTU: Invalid argument

Also an error like the one below can be seen on the /var/log/messages file of the vServer:
kernel: T5074835532 [mlx4_vnic] eth1:vnic_change_mtu:360: failed: new_mtu 64000 > 2026
The MOS Note 1611657.1 is very useful for this purpose.

Friday May 30, 2014

Running Mixed Physical and Virtual Exalogic Elastic Cloud Software Versions in an Exalogic Rack is now Supported

Although it was not supported on older versions, now as of EECS 2.0.6, an Exalogic rack now can be configured in a mixed-mode: half virtual and half physical Linux:
  • Flexibility to have physical and virtual environments on same rack. For example, production on physical and test/dev on virtual.
  • Exalogic Control manages the virtual compute nodes on the rack. Physical compute nodes are managed manually (including PKeys).
  • Option to change full physical to hybrid and hybrid to full virtual rack.
  • User has an option to choose either the top or bottom nodes for physical or virtual deployment.
For further information about how the compute nodes can be split up on the rack (into bottom or top half) to run either Oracle Virtual Server (OVS "hypervisor") or Oracle Linux, please take a look at MOS Note 1536945.1.

Note: Solaris is not yet supported in the mixed configuration.

Monday Apr 28, 2014

Exalogic eBook: "The Logical Choice for Running Business Applications"

Oracle is pleased to announce the Oracle Exalogic eBook – an interactive asset packed with key product information, customer references, links to useful assets and much more.

I have been reading it, and I recommend everyone to read it as well. It is really amazing to see how an IT architecture can be improved with Exalogic.

For further information about the eBook, read it and download it, please go to this link.

Thursday Apr 10, 2014

Working with SCAN on Exalogic

During this last time, I have seen more Exalogic customers using SCAN, so decided to write a brief article about it, although there is a lot of documentation from Oracle about it (but not related to Exalogic itself).

Single Client Access Name (SCAN) is a JDBC driver feature that provides a single name for clients to access any Oracle Database running in a cluster. Some of its benefits and advantages include:
- Client’s connect information does not need to change if you add or remove nodes or databases in the cluster
- Fast Connection Failover (FCF)
- Runtime Connection Load-Balancing (RCLB)
- Can be implemented with MultiDataSource or GridLink
- It can also be used with Oracle JDBC 11g Thin Driver (this is clearly explained on MOS Note 1290193.1)

In the particular case of Exalogic, a typical architecture, widely used by customers, is having it connected to an Exadata machine (which hosts the database) through InfiniBand. Obviously the SCAN feature can be used within this Engineered Systems architecture. As a matter of fact, GridLink is part of the Exalogic-specific enhancements of Weblogic.

Some facts to keep in mind when using it:
- SCAN feature is supported on JDBC version and above
- Just as any situation when a connection to database is involved, need to be careful that firewalls may cause some network adapter or timeout issues, which must be solved so the connection can be established
- If using VIP hosts, instead of cluster configuration having the short host name for the VIP hosts, you should set LOCAL_LISTENER to the fully qualified domain name (e.g.

Wednesday Mar 26, 2014

Independent Clusters

Some days ago, I was asked if it's possible to have an Oracle Service Bus (OSB) Domain domain with 2 independent clusters. The answer is no. In a domain you can have one and only one OSB cluster. An OSB cluster can co-exist with other clusters, like a WLS-only clusters, SOA Suite cluster, etc., but not with another OSB cluster. As a matter of fact, in this case, the observed behavior was that all services were assigned to the first created OSB cluster, and the newer one was not responding the URL calls. Works as designed.

However, for a WLS domain, is it possible to have 2 independent clusters? In this case, the answer is yes. Here you can check what types of objects can be clustered and what cannot be clustered.

The following types of objects can be clustered in a WebLogic Server deployment:
  • Servlets
  • JSPs
  • EJBs
  • Remote Method Invocation (RMI) objects
  • Java Messaging Service (JMS) destinations
The following APIs and internal services cannot be clustered in WebLogic Server:
  • File services including file shares
  • Time service

Wednesday Dec 18, 2013

"Cannot allocate memory" message when accessing a Compute Node through SSH, despite ILOM shows available memory

We recently worked an issue where it happened that when accessing the server through ssh it threw a "Cannot allocate memory" message, however ILOM showed available memory.

This happened due to a known bug with a package, related with the ypserv utility:

The problem is that even though there is enough free memory, it is too fragmented to allocate two contiguous pages.

The command below can be used:
# echo 3 > /proc/sys/vm/drop_caches
This may allow for memory to become defragmented enough for further fork() system calls to succeed otherwise it may be necessary to reboot the system.

Principal Technical Support Engineer in the Engineered (Systems) Enterprise Support Team - EEST.
Former member of the Coherence and Java Technologies Support Teams.


« October 2015