Tuesday Mar 17, 2015

MOS Note 1963189.1 has been Improved with a Workaround

About 2 months ago, I created the MOS Note 1963189.1 - "OEM (Enterprise Manager) Reporting IB Switch Ports As Being Disconnected On Exalogic Physical Compute Nodes Although No Issues With IB Links/Ports".

Originally describes a situation where some false-positive messages are thrown by OEM 12c, incorrectly reporting IB ports as disconnected. Those messages are harmless and can be safely ignored.

But now, an easy to implement workaround to avoid these messages has been included on this note as well.

Tuesday Dec 23, 2014

The vServers were migrated to another Compute Node... Oh My!

Recently I was working 2 similar SRs, both related to this behavior of Exalogic.

When for some reason a compute node is rebooted, the vServers running on it are automatically moved to other compute nodes. Why does this happen? This will occur if the HA flag is enabled on a vServer, which can verified by looking at vm.cfg file of that VM.

After the compute node that was rebooted is back up and running, you may probably want those migrated vServers to be located where they were before the outage (that is, on the previously failed compute node). Unfortunately, "Live Migration" or the ability to migrate running vServers is not yet supported in Exalogic. By design, migration is not exposed at the vDC level, a vDC cloud user does not care about where his vServer runs, it runs in the cloud. So, you need to use "Cold Migration" instead.

Basically, cold migration may be executed manually by admin with "root" privileges on OVS node:
- stop a vServer, it will be detached from OVS
- start a vServer on the OVS node you want your vServer to run, making sure you do not switch server pools

Wednesday Aug 20, 2014

How to set a Static Route on a Storage Node

To set up a host route to an IP address, here are the procedures for BUI and CLI. You need to know the destination, mask, interface and network. Note that, in this case, the values are just examples.

- Log into CLI and run the commands below:
configuration net routing create
set family=IPv4
set destination=
set mask=32
set gateway=
set interface=igb0

- Log in to the web ui of the ZFSSA NAS head
- Click Configuration -> Network -> Routing -> (+)
- In the popup window that will be displayed, enter the values accordingly on the popup window shown on the screenshot below:

Any of the two above procedures should get your desired route in place.

Tuesday Aug 12, 2014

A "ZFS Storage Appliance is not reachable" message is thrown by Exachk and the ZFS checks are skipped. WAD?

Sometimes it may happen that something like the following can be seen on the "Skipped Nodes" section:

Host NameReason
myexalogic01sn01-mgm1ZFS Storage Appliance is not reachable
myexalogic01sn01-mgm2ZFS Storage Appliance is not reachable

Also, a message like the following can be seen when executing Exachk:
Could not find infiniband gateway switch names from env or configuration file.
Please enter the first gateway infiniband switch name :
Could not find storage node names from env or configuration file.
Please enter the first storage server :

This is because the way Exachk works on this is based on the standard naming convention of "<rack-name>sn0x" format.

To solve this, make sure there is an o_storage.out file in the directory where Exachk is installed. If the file is missing, create a blank one.

The o_storage.out must contain the right storage nodes hostnames in the format they have in hosts file. This format should typically be something like "<rack-name>sn0x-mgmt" For example an o_storage.out should look quite simply as below:

This way it is ensured that the o_storage.out file has valid ZFS Storage Appliance hostnames.

If the switch checks are skipped, then a similar procedure should be performed with the o_switches.out file.

Wednesday Jun 11, 2014

Setting MTU on Exalogic

For many reasons, a system administrator may want to change the MTU settings of a server. But in a system like Exalogic which contains lots of interconnected nodes and other various components, it's important to understand how this applies to the different networks.

For example, when bringing up bonding of InfiniBand an error like the following may be thrown:
Bringing up interface bond1: SIOCSIFMTU: Invalid argument
Both scripts ifcfg-ib0 and ifcfg-ib1 (from the /etc/sysconfig/network-scripts/ direectory) have MTU set to 65500, which is a valid MTU value only if all IPoIB slaves operate in connected mode and are configured with the same value, so the line below must be added to both network scripts and then restart the network:

By the way, an error of the form “SIOCSIFMTU: Invalid argument” indicates that the requested MTU was rejected by the kernel. Typically this would be due to it exceeding the maximum value supported by the interface hardware. In that case you must either reduce the MTU to a value that is supported or obtain more capable hardware. This problem has been seen when trying to modify the MTU using the ifconfig command, like the output of the example below:
[root@elxxcnxx ~]# ifconfig ib1 mtu 65520
SIOCSIFMTU: Invalid argument

It's important to insist that in most cases the nodes must be rebooted after the MTU size has been changed. Although in some circumstances it may work without a reboot, it is not how it is typically documented.

Now, in order to achieve a reduced memory consumption and improve performance for network traffic received on IPoIB related interfaces, it is recommend to reduce the MTU value in interface configuration files for IPoIB related bonds from 65520 to 64000. The change needs to be made to interface configuration files under the /etc/sysconfig/network-scripts directory and applies to the interface configuration files for bonds over IPoIB related slave devices, for example /etc/sysconfig/network-scripts/ifcfg-bond1. However, keep in mind that the numeric portion of the interface filenames that corresponding to IPoIB interfaces is expected to vary across compute nodes and vServers and so cannot be relied upon to identify which interface files are for bonds are over IPoIB rather than EoIB related slave interfaces.
To fix these MTU values to the recommended settings, there are very useful instructions and a script on the MOS Note 1624434.1, and it's applicable physical and virtual configurations of Exalogic.

Regarding the recommended MTU value for EoIB related interfaces, its maximum appropriate value is 1500. If for some reason a vServer has been created with a higher value (set on the /etc/sysconfig/network-scripts/ifcfg-bond0 file), then it must be fixed. An error like the following could be thrown under this circumstance:
[root@vServer ~]# service network restart
Bringing up interface bond0:  SIOCSIFMTU: Invalid argument

Also an error like the one below can be seen on the /var/log/messages file of the vServer:
kernel: T5074835532 [mlx4_vnic] eth1:vnic_change_mtu:360: failed: new_mtu 64000 > 2026
The MOS Note 1611657.1 is very useful for this purpose.

Friday May 30, 2014

Running Mixed Physical and Virtual Exalogic Elastic Cloud Software Versions in an Exalogic Rack is now Supported

Although it was not supported on older versions, now as of EECS 2.0.6, an Exalogic rack now can be configured in a mixed-mode: half virtual and half physical Linux:
  • Flexibility to have physical and virtual environments on same rack. For example, production on physical and test/dev on virtual.
  • Exalogic Control manages the virtual compute nodes on the rack. Physical compute nodes are managed manually (including PKeys).
  • Option to change full physical to hybrid and hybrid to full virtual rack.
  • User has an option to choose either the top or bottom nodes for physical or virtual deployment.
For further information about how the compute nodes can be split up on the rack (into bottom or top half) to run either Oracle Virtual Server (OVS "hypervisor") or Oracle Linux, please take a look at MOS Note 1536945.1.

Note: Solaris is not yet supported in the mixed configuration.

Monday Apr 28, 2014

Exalogic eBook: "The Logical Choice for Running Business Applications"

Oracle is pleased to announce the Oracle Exalogic eBook – an interactive asset packed with key product information, customer references, links to useful assets and much more.

I have been reading it, and I recommend everyone to read it as well. It is really amazing to see how an IT architecture can be improved with Exalogic.

For further information about the eBook, read it and download it, please go to this link.

Thursday Apr 10, 2014

Working with SCAN on Exalogic

During this last time, I have seen more Exalogic customers using SCAN, so decided to write a brief article about it, although there is a lot of documentation from Oracle about it (but not related to Exalogic itself).

Single Client Access Name (SCAN) is a JDBC driver feature that provides a single name for clients to access any Oracle Database running in a cluster. Some of its benefits and advantages include:
- Client’s connect information does not need to change if you add or remove nodes or databases in the cluster
- Fast Connection Failover (FCF)
- Runtime Connection Load-Balancing (RCLB)
- Can be implemented with MultiDataSource or GridLink
- It can also be used with Oracle JDBC 11g Thin Driver (this is clearly explained on MOS Note 1290193.1)

In the particular case of Exalogic, a typical architecture, widely used by customers, is having it connected to an Exadata machine (which hosts the database) through InfiniBand. Obviously the SCAN feature can be used within this Engineered Systems architecture. As a matter of fact, GridLink is part of the Exalogic-specific enhancements of Weblogic.

Some facts to keep in mind when using it:
- SCAN feature is supported on JDBC version and above
- Just as any situation when a connection to database is involved, need to be careful that firewalls may cause some network adapter or timeout issues, which must be solved so the connection can be established
- If using VIP hosts, instead of cluster configuration having the short host name for the VIP hosts, you should set LOCAL_LISTENER to the fully qualified domain name (e.g. node-VIP.example.com):

Wednesday Dec 18, 2013

"Cannot allocate memory" message when accessing a Compute Node through SSH, despite ILOM shows available memory

We recently worked an issue where it happened that when accessing the server through ssh it threw a "Cannot allocate memory" message, however ILOM showed available memory.

This happened due to a known bug with a package, related with the ypserv utility:

The problem is that even though there is enough free memory, it is too fragmented to allocate two contiguous pages.

The command below can be used:
# echo 3 > /proc/sys/vm/drop_caches
This may allow for memory to become defragmented enough for further fork() system calls to succeed otherwise it may be necessary to reboot the system.

Wednesday Nov 13, 2013

Unable to change NIS password

On the last few weeks I worked a Service Request where Linux user was unable to change user passwords NIS. As a matter of fact, NIS is used by several people on their Exalogic environments.

[root@computenode ~]# passwd user01
Changing password for user user01.
passwd: Authentication token manipulation error
[root@computenode ~]#
In this case, "user01" is an example username.

This issue may occur on all NIS nodes and even on master server as well.
This error typically corresponds to typos or missing keywords in configuration files from the /etc/pam.d directory.
On the Service Request that I worked, the file system-auth-ac had no nis keyword:

# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth required pam_env.so
auth sufficient pam_unix.so nullok try_first_pass
auth requisite pam_succeed_if.so uid >= 500 quiet
auth required pam_deny.so

account required pam_unix.so
account sufficient pam_succeed_if.so uid < 500 quiet
account required pam_permit.so

password requisite pam_cracklib.so try_first_pass retry=3
password sufficient pam_unix.so md5 shadow remember=5 nullok try_first_pass use_authtok
password required pam_deny.so

session optional pam_keyinit.so revoke
session required pam_limits.so
session [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid
session required pam_unix.so

This kind of issue has also been described in a case where the "pam_rootok.so" line had a typo ("sufficent" instead of correct "sufficient") on the su file.

To solve this kind of issues, first the typos must be (obviously) fixed.
For this NIS case, it is necessary to make sure that keyword is added:
password sufficient pam_unix.so md5 shadow nis remember=5 nullok try_first_pass use_authtok

Note that these settings should be consistent across all the NIS nodes (master server and clients).

Tuesday Oct 22, 2013

Minimum percentage of free physical memory that Linux requires for optimal performance

Recently, we have been getting questions about this percentage of free physical memory that OS require for optimal performance, mainly applicable to physical compute nodes.

Under normal conditions you may see that at the nodes without any application running the OS take (for example) between 24 and 25 GB of memory.
The Linux system reports the free memory in a different way, and most of those 25gbs (of the example) are available for user processes.
IE: Mem: 99191652k total, 23785732k used, 75405920k free, 173320k buffers

The MOS Doc Id. 233753.1 - "Analyzing Data Provided by '/proc/meminfo'" - explains it (section 4 - "Final Remarks"):
Free Memory and Used Memory
Estimating the resource usage, especially the memory consumption of processes is by far more complicated than it looks like at a first glance. The philosophy is an unused resource is a wasted resource.The kernel therefore will use as much RAM as it can to cache information from your local and remote filesystems/disks. This builds up over time as reads and writes are done on the system trying to keep the data stored in RAM as relevant as possible to the processes that have been running on your system. If there is free RAM available, more caching will be performed and thus more memory 'consumed'. However this doesn't really count as resource usage, since this cached memory is available in case some other process needs it. The cache is reclaimed, not at the time of process exit (you might start up another process soon that needs the same data), but upon demand.

That said, focusing more specifically on the percentage question, apart from this memory that OS takes, how much should be the minimum free memory that must be available every node so that they operate normally?
The answer is: As a rule of thumb 80% memory utilization is a good threshold, anything bigger than that should be investigated and remedied.

Wednesday Sep 25, 2013

Adding NTP Servers to Exalogic Switches

Sometimes there are some misconfigurations on the Exalogic machines as for some reasons compute nodes may have more specified addresses for NTP than the switches. So here are a few steps to add those in the switches.

To add another NTP server to the Infiniband/Gateway switches:
- Login on the switch ILOM BUI (preferably MSIE)
- Select the Configuration tab
- Select the Clock subtab
- Enter the NTP IP servers addresses appropriately

To configure two NTP servers on the the Cisco Ethernet Switch, please take a look at http://docs.oracle.com/cd/E18476_01/doc.220/e18478/spreadsheet.htm#BIIGDJEA - "5.4.1 Configuring the Cisco Ethernet Switch".

Note that, besides adding servers, you can also modify these configurations at your convenience.

Wednesday Aug 21, 2013

Resource allocation on vServers

A few facts to keep in mind.

The number of vCPUs to existent vServer cannot be modified. Currently vDC resource management (changing vCPU, memory, etc.) after a vServer has been created is not supported in Exalogic Virtual environments. However, a customer vServer type with required memory can be created and used instead of the default VM types.

Note that the Server pool management is handled by Exalogic, manual management of it is not available. Also, keep in mind that there is no way to control which physical host is assigned a given vServer. The scheduler algorithms are sophisticated but it is fine-grained and there is no reason to assume that the scheduler will not be maximally efficient. EMOC will look at the resources allocated to the vServer that is being planned on creating (CPUs/memory), then look at what is available on each of the compute nodes and make a decision on where to place the vServer. System administrators can try to use Distribution groups and separate the vServers they have to run across different Oracle VM Servers.

If need to increase Root File System and Swap Space of an Exalogic Guest Virtual Machine, then the MOS Note 1575790.1 is useful for that purpose.

Thursday Jun 20, 2013

Exachk 2.2.2 released - Now Includes Support for Exalogic Solaris Environments

This new Exachk 2.2.2 release includes several new improvements and features.

The following are additional checks as part of this new 2.2.2 release for Solaris:
Compute Nodes
- Hardware and Firmware Profile
- Software Profile
- NTP Synchronization
- DNS Setup
- Correct Slot Installation of IB Card for Solaris
- Subnet Manager
- Root Partition Usage Limit for Solaris
- Lockd Configuration for Solaris Compute Node
- ib_ipoib Module for Solaris
- ib_sdp Module for Solaris
- IP Configuration - net0 and bond0
- Recent Reboot Info for Solaris
- Probe Based IPMP for Solaris
- Swap Space for Solaris
- Free Physical Memory for Solaris
- MTU for Solaris
- IPMP Configuration for Solaris
- Fault Management Log for Solaris
- BIOS Settings
- NFS Mount Point - Version for Solaris
- Hostname Consistency with DNS on the Physical Compute Node
- NFS Mount Point - Attribute Caching for Solaris
- NFS Mount Point - Rsize Wsize for Solaris
- NIS domain (YPBind) for Solaris

Also, the following checks have been enhanced as part of this new 2.2.2 release:
Compute Nodes
- Connectivity To OVMM
- MTU Value for Infiniband Interface
- Hostname Matches the DNS on the Physical Compute Node
- Non-sequential Even-numbered Gateway Instance
- NTP Configuration for Switch Nodes Matches Physical Compute Nodes
- NTP Configuration for Switch Nodes Matches Oracle VM Servers
- Hostname Matches the DNS on Oracle VM Server
- Hostname Matching with DNS on Switches
- NTP Configuration for ZFS Matches Oracle VM Servers
- NTP Configuration for ZFS Matches Physical Compute Nodes
Multiple Components
- MTU for InfiniBand Link in Control vServers

Exachk is available via MOS Doc Id. 1449226.1

Friday Jan 18, 2013

Two New Exalogic ZFS Storage Appliance MOS Notes

This week I have closed 2 Service Requests related to the ZFS Storage Appliance and I created My Oracle Support (MOS) notes from both of them as, despite they were not complicated issues and the SRs were both closed in less than one week, these procedures were still not formally documented on MOS. Below can be seen the information about these created documents.

MOS Doc Id. 1519858.1 - Will The Restart Of The NIS Service On The ZFS Storage Appliance Affect The Mounted Filesystems?

On this case, for a particular reason it was necessary to restart the NIS service. So, if for any reason, the NIS service needs to be restarted on the ZFS Storage Appliance, will the mounted filesystems be affected during the restart?

The default cluster configuration type of the ZFS storage appliance is active-passive and the storage nodes are supposed to be mirrored, so the restart of NIS should not be causing any issues; it can be done.

Note that restart of NIS should be done on the active storage head. Restarting the NIS itself will not cause any ZFS failover from Active to Passive.

In general terms, even in the event of a storage node failure, the appliance will automatically fail over to the other storage node. Under that condition, an initial degradation in performance can be expected because all of the cached data on the failed node is gone, but this effect decreases as the new active storage node begins caching data in its own SSDs.

MOS Doc Id. 1520223.1 - Exalogic Storage Nodes Hostnames Are Displayed Incorrectly

This was not the first time I saw something like this, so decided to create a note because clearly is a problem that may affect to more than one Exalogic user.

The Exalogic storage node hostnames displayed on the BUI were different than the ones displayed when accessing the node through SSH or ILOM.

This happens because for any reason the hostname is misconfigured on the ZFS Storage Appliance.

To solve this problem, it is necessary to set the system name and location accordingly on the Storage Appliance nodes BUI:
1. Login on the ZFS Storage Appliance BUI
2. Go to the "Configuration" tab, and select the "Services" subtab
3. Under the "Systems Settings" section, click on "System Identity"
4. Set the system name and location accordingly


Principal Technical Support Engineer in the Engineered (Systems) Enterprise Support Team - EEST.
Former member of the Coherence and Java Technologies Support Teams.


« March 2015