Tuesday Mar 31, 2015

Locality Group Observability on Solaris

Modern multi-socket servers exhibit NUMA characteristics that may hurt application performance if ignored. On a NUMA system (Non-uniform Memory Access), all memory is shared between/among processors. Each processor has access to its own memory - local memory - as well as memory that is local to another processor -- remote memory. However the memory access time (latency) depends on the memory location relative to the processor. A processor can access its local memory faster than the remote memory, and these varying memory latencies play a big role in application performance.

Solaris organizes the hardware resources -- CPU, memory and I/O devices -- into one or more logical groups based on their proximity to each other in such a way that all the hardware resources in a group are considered local to that group. These groups are referred as locality groups or NUMA nodes. In other words, a locality group (lgroup) is an abstraction that tells what hardware resources are near each other on a NUMA system. Each locality group has at least one processor and possibly some associated memory and/or IO devices. To minimize the impact of NUMA characteristics, Solaris considers the lgroup based physical topology when mapping threads and data to CPUs and memory.

Note that even though Solaris attempts to provide good performance out of the box, some applications may still suffer the impact of NUMA either due to misconfiguration of the hardware/software or some other reason. Engineered systems such as Oracle SuperCluster go to great lengths in setting up customer environments to minimize the impact of NUMA so applications perform as expected in a predictable manner. Still application developers and system/application administrators need to take NUMA factor into account while developing for and managing applications on large systems. Solaris provided tools and APIs can be used to observe, diagnose, control and even correct or fix the issues related to locality and latency. Rest of this post is about the tools that can be used to examine the locality of cores, memory and I/O devices.

Sample outputs are collected from a SPARC T4-4 server.

Locality Group Hierarchy

lgrpinfo prints information about the lgroup hierarchy and its contents. It is useful in understanding the context in which the OS is trying to optimize applications for locality, and also in figuring out which CPUs are closer, how much memory is near them, and the relative latencies between the CPUs and different memory blocks.

eg.,

# lgrpinfo -a

lgroup 0 (root):
        Children: 1-4
        CPUs: 0-255
        Memory: installed 1024G, allocated 75G, free 948G
        Lgroup resources: 1-4 (CPU); 1-4 (memory)
        Latency: 18
lgroup 1 (leaf):
        Children: none, Parent: 0
        CPUs: 0-63
        Memory: installed 256G, allocated 18G, free 238G
        Lgroup resources: 1 (CPU); 1 (memory)
        Load: 0.0227
        Latency: 12
lgroup 2 (leaf):
        Children: none, Parent: 0
        CPUs: 64-127
        Memory: installed 256G, allocated 15G, free 241G
        Lgroup resources: 2 (CPU); 2 (memory)
        Load: 0.000153
        Latency: 12
lgroup 3 (leaf):
        Children: none, Parent: 0
        CPUs: 128-191
        Memory: installed 256G, allocated 20G, free 236G
        Lgroup resources: 3 (CPU); 3 (memory)
        Load: 0.016
        Latency: 12
lgroup 4 (leaf):
        Children: none, Parent: 0
        CPUs: 192-255
        Memory: installed 256G, allocated 23G, free 233G
        Lgroup resources: 4 (CPU); 4 (memory)
        Load: 0.00824
        Latency: 12

Lgroup latencies:

------------------
  |  0  1  2  3  4
------------------
0 | 18 18 18 18 18
1 | 18 12 18 18 18
2 | 18 18 12 18 18
3 | 18 18 18 12 18
4 | 18 18 18 18 12
------------------

CPU Locality

lgrpinfo utility shown above already provides CPU locality in a clear manner. Here is another way to retrieve the association between CPU ids and lgroups.

# echo ::lgrp -p | mdb -k

   LGRPID  PSRSETID      LOAD      #CPU      CPUS
        1         0     17873        64      0-63
        2         0     17755        64      64-127
        3         0      2256        64      128-191
        4         0     18173        64      192-255

Memory Locality

lgrpinfo utility shown above shows the total memory that belongs to each of the locality groups. However, it doesn't show exactly what memory blocks belong to what locality groups. One of mdb's debugger command (dcmd) helps retrieve this information.

1. List memory blocks

# ldm list-devices -a memory

MEMORY
     PA                   SIZE            BOUND
     0xa00000             32M             _sys_
     0x2a00000            96M             _sys_
     0x8a00000            374M            _sys_
     0x20000000           1048064M        primary


2. Print the physical memory layout of the system

# echo ::syslayout | mdb -k

         STARTPA            ENDPA  SIZE  MG MN    STL    ETL
        20000000        200000000  7.5g   0  0      4     40
       200000000        400000000    8g   1  1    800    840
       400000000        600000000    8g   2  2   1000   1040
       600000000        800000000    8g   3  3   1800   1840
       800000000        a00000000    8g   0  0     40     80
       a00000000        c00000000    8g   1  1    840    880
       c00000000        e00000000    8g   2  2   1040   1080
       e00000000       1000000000    8g   3  3   1840   1880
      1000000000       1200000000    8g   0  0     80     c0
      1200000000       1400000000    8g   1  1    880    8c0
      1400000000       1600000000    8g   2  2   1080   10c0
      1600000000       1800000000    8g   3  3   1880   18c0
	...
	...

The values under MN column (memory node) can be treated as lgroup numbers after adding 1 to existing values. For example, a value of zero under MN translates to lgroup 1, 1 under MN translate to lgroup 2 and so on. Better yet, ::mnode debugger command lists out the mapping of mnodes to lgroups as shown below.

# echo ::mnode | mdb -k

           MNODE ID LGRP ASLEEP UTOTAL  UFREE UCACHE KTOTAL  KFREE KCACHE
     2075ad80000  0    1      -   249g   237g   114m   5.7g   714m      -
     2075ad802c0  1    2      -   240g   236g   288m    15g   4.8g      -
     2075ad80580  2    3      -   246g   234g   619m   9.6g   951m      -
     2075ad80840  3    4      -   247g   231g    24m     9g   897m      -

Unrelated notes:

  • Main memory on T4-4 is interleaved across all memory banks with 8 GB interleave size -- meaning first 8 GB chunk excluding _sys_ blocks will be populated in lgroup 1 closer to processor #1, second 8 GB chunk in lgroup 2 closer to processor #2, third 8 GB chunk in lgroup 3 closer to processor #3, fourth 8 GB chunk in lgroup 4 closer to processor #4 and then the fifth 8 GB chunk again in lgroup 1 closer to processor #1 and so on. Memory is not interleaved on T5 and M6 systems (confirm by running the ::syslayout dcmd). Conceptually memory interleaving is similar to disk striping.

  • Keep in mind that debugger commands (dcmd) are not committed - thus, there is no guarantee that they continue to work on future versions of Solaris. Some of these dcmds may not work on some of the existing versions of Solaris.

I/O Device Locality

-d option to lgrpinfo utility accepts a specified path to an I/O device and return the lgroup IDs closest to that device. Each I/O device on the system can be connected to one or more NUMA nodes - so, it is not uncommon to see more than one lgroup ID returned by lgrpinfo.

eg.,

# lgrpinfo -d /dev/dsk/c1t0d0
lgroup ID : 1

# dladm show-phys | grep 10000
net4              Ethernet             up         10000  full      ixgbe0

# lgrpinfo -d /dev/ixgbe0
lgroup ID : 1

# dladm show-phys | grep ibp0
net12             Infiniband           up         32000  unknown   ibp0

# lgrpinfo -d /dev/ibp0
lgroup IDs : 1-4

NUMA IO Groups

Debugger command ::numaio_group shows information about all NUMA I/O Groups.

# dladm show-phys | grep up
net0              Ethernet             up         1000   full      igb0
net12             Ethernet             up         10     full      usbecm2
net4              Ethernet             up         10000  full      ixgbe0

# echo ::numaio_group | mdb -k
            ADDR GROUP_NAME                     CONSTRAINT
    10050e1eba48 net4                  lgrp : 1
    10050e1ebbb0 net0                  lgrp : 1
    10050e1ebd18 usbecm2               lgrp : 1
    10050e1ebe80 scsi_hba_ngrp_mpt_sas1  lgrp : 4
    10050e1ebef8 scsi_hba_ngrp_mpt_sas0  lgrp : 1

Relying on prtconf is another way to find the NUMA IO locality for an IO device.

eg.,

# dladm show-phys | grep up | grep ixgbe
net4              Ethernet             up         10000  full      ixgbe0

== Find the device path for the network interface ==
# grep ixgbe /etc/path_to_inst | grep " 0 "
"/pci@400/pci@1/pci@0/pci@4/network@0" 0 "ixgbe"

== Find NUMA IO Lgroups ==
# prtconf -v /devices/pci@400/pci@1/pci@0/pci@4/network@0
	...
    Hardware properties:
	...
        name='numaio-lgrps' type=int items=1
            value=00000001
	...

Process, Thread Locality

  • -H of prstat command shows the home lgroup of active user processes and threads.

  • -h of ps command can be used to examine the home lgroup of all user processes and threads. -H option can be used to list all processes that are in a certain locality group.
            [Related] Solaris assigns a thread to an lgroup when the thread is created. That lgroup is called the thread's home lgroup. Solaris runs the thread on the CPUs in the thread's home lgroup and allocates memory from that lgroup whenever possible.

  • plgrp tool shows the placement of threads among locality groups. Same tool can be used to set the home locality group and lgroup affinities for one or more processes, threads, or LWPs.

  • -L option of pmap command shows the lgroup that contains the physical memory backing some virtual memory.
            [Related] Breakdown of Oracle SGA into Solaris Locality Groups

  • Memory placement among lgroups can possibly be achieved using pmadvise when the application is running or by using madvise() system call during development, which provides advice to the kernel's virtual memory manager. The OS will use this hint to determine how to allocate memory for the specified range. This mechanism is beneficial when the administrators and developers understand the target application's data access patterns.

    It is not possible to specify memory placement locality for OSM & ISM segments using pmadvise command or madvise() system call (DISM is an exception).

Examples:

# prstat -H

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU LGRP PROCESS/NLWP
  1865 root      420M  414M sleep    59    0 447:51:13 0.1%    2 java/108
  3659 oracle   1428M 1413M sleep    38    0  68:39:28 0.0%    4 oracle/1
  1814 oracle    155M  110M sleep    59    0  70:45:17 0.0%    4 gipcd.bin/9
     8 root        0K    0K sleep    60    -  70:52:21 0.0%    0 vmtasks/257
  3765 root      447M  413M sleep    59    0  29:24:20 0.0%    3 crsd.bin/43
  3949 oracle    505M  456M sleep    59    0   0:59:42 0.0%    2 java/124
 10825 oracle   1097M 1074M sleep    59    0  18:13:27 0.0%    3 oracle/1
  3941 root      210M  184M sleep    59    0  20:03:37 0.0%    4 orarootagent.bi/14
  3743 root      119M   98M sleep   110    -  24:53:29 0.0%    1 osysmond.bin/13
  3324 oracle    266M  225M sleep   110    -  19:52:31 0.0%    4 ocssd.bin/34
  1585 oracle    122M   91M sleep    59    0  18:06:34 0.0%    3 evmd.bin/10
  3918 oracle    168M  144M sleep    58    0  14:35:31 0.0%    1 oraagent.bin/28
  3427 root      112M   80M sleep    59    0  12:34:28 0.0%    4 octssd.bin/12
  3635 oracle   1425M 1406M sleep   101    -  13:55:31 0.0%    4 oracle/1
  1951 root      183M  161M sleep    59    0   9:26:51 0.0%    4 orarootagent.bi/21
Total: 251 processes, 2414 lwps, load averages: 1.37, 1.46, 1.47

== Locality group 2 is the home lgroup of the java process with pid 1865 == 

# plgrp 1865

     PID/LWPID    HOME
    1865/1        2
    1865/2        2
	...
	...
    1865/22       4
    1865/23       4
	...
	...
    1865/41       1
    1865/42       1
	...
	...
    1865/60       3
    1865/61       3
	...
	...

# plgrp 1865 | awk '{print $2}' | grep 2 | wc -l
      30

# plgrp 1865 | awk '{print $2}' | grep 1 | wc -l
      25

# plgrp 1865 | awk '{print $2}' | grep 3 | wc -l
      25

# plgrp 1865 | awk '{print $2}' | grep 4 | wc -l
      28


== Let's reset the home lgroup of the java process id 1865 to 4 ==


# plgrp -H 4 1865
     PID/LWPID    HOME
    1865/1        2 => 4
    1865/2        2 => 4
    1865/3        2 => 4
    1865/4        2 => 4
	...
	...
    1865/184      1 => 4
    1865/188      4 => 4

# plgrp 1865 | awk '{print $2}' | egrep "1|2|3" | wc -l
       0

# plgrp 1865 | awk '{print $2}' | grep 4 | wc -l
     108

# prstat -H -p 1865

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU LGRP PROCESS/NLWP
  1865 root      420M  414M sleep    59    0 447:57:30 0.1%    4 java/108

== List the home lgroup of all processes ==

# ps -aeH
  PID LGRP TTY         TIME CMD
    0    0 ?           0:11 sched
    5    0 ?           4:47 zpool-rp
    1    4 ?          21:04 init
    8    0 ?        4253:54 vmtasks
   75    4 ?           0:13 ipmgmtd
   11    3 ?           3:09 svc.star
   13    4 ?           2:45 svc.conf
 3322    1 ?         301:51 cssdagen
	...
11155    3 ?           0:52 oracle
13091    4 ?           0:00 sshd
13124    3 pts/5       0:00 bash
24703    4 pts/8       0:00 bash
12812    2 pts/3       0:00 bash
	...

== Find out the lgroups which shared memory segments are allocated from ==

# pmap -Ls 24513 | egrep "Lgrp|256M|2G"

         Address       Bytes Pgsz Mode   Lgrp Mapped File
0000000400000000   33554432K   2G rwxs-    1   [ osm shmid=0x78000047 ]
0000000C00000000     262144K 256M rwxs-    3   [ osm shmid=0x78000048 ]
0000000C10000000     524288K 256M rwxs-    2   [ osm shmid=0x78000048 ]
0000000C30000000     262144K 256M rwxs-    3   [ osm shmid=0x78000048 ]
0000000C40000000     524288K 256M rwxs-    1   [ osm shmid=0x78000048 ]
0000000C60000000     262144K 256M rwxs-    2   [ osm shmid=0x78000048 ]

== Apply MADV_ACCESS_LWP policy advice to a segment at a specific address ==

# pmap -Ls 1865 | grep anon

00000007DAC00000      20480K   4M rw---    4   [ anon ]
00000007DC000000       4096K    - rw---    -   [ anon ]
00000007DFC00000      90112K   4M rw---    4   [ anon ]
00000007F5400000     110592K   4M rw---    4   [ anon ]

# pmadvise -o 7F5400000=access_lwp 1865

# pmap -Ls 1865 | grep anon
00000007DAC00000      20480K   4M rw---    4   [ anon ]
00000007DC000000       4096K    - rw---    -   [ anon ]
00000007DFC00000      90112K   4M rw---    4   [ anon ]
00000007F5400000      73728K   4M rw---    4   [ anon ]
00000007F9C00000      28672K    - rw---    -   [ anon ]
00000007FB800000       8192K   4M rw---    4   [ anon ]

SEE ALSO:

  • - Man pages of lgrpinfo(1), plgrp(1), pmap(1), prstat(1M), ps(1), pmadvise(1), madvise(3C), madv.so.1(1), mdb(1)
  • - Web search keywords: NUMA, cc-NUMA, locality group, lgroup, lgrp, Memory Placement Optimization, MPO

Credit: various internal and external sources

Tuesday Dec 23, 2014

Solaris Studio : C/C++ Dynamic Analysis

First, a reminder - Oracle Solaris Studio 12.4 is now generally available. Check the Solaris Studio 12.4 Data Sheet before downloading the software from Oracle Technology Network.

Dynamic Memory Usage Analysis

Code Analyzer tool in Oracle Solaris Studio compiler suite can analyze static data, dynamic memory access data, and code coverage data collected from binaries that were compiled with the C/C++ compilers in Solaris Studio 12.3 or later. Code Analyzer is supported on Solaris and Oracle Enterprise Linux.

Refer to the static code analysis blog entry for a quick summary of steps involved in performing static analysis. The focus of this blog entry is the dynamic portion of the analysis. In this context, dynamic analysis is the evaluation of an application during runtime for memory related errors. Main objective is to find and debug memory management errors -- robustness and security assurance are nice side effects however limited their extent is.

Code Analyzer relies on another primary Solaris Studio tool, discover, to find runtime errors that are often caused by memory mismanagement. discover looks for potential errors such as accessing outside the bounds of the stack or an array, unallocated memory reads and writes, NULL pointer deferences, memory leaks and double frees. Full list of memory management issues analyzed by Code Analyzer/discover is at: Dynamic Memory Access Issues

discover performs the dynamic analysis by instrumenting the code so that it can keep track of memory operations while the binary is running. During runtime, discover monitors the application's use of memory by interposing on standard memory allocation calls such as malloc(), calloc(), memalign(), valloc() and free(). Fatal memory access errors are detected and reported immediately at the instant the incident occurs, so it is easy to correlate the failure with actual source. This behavior helps in detecting and fixing memory management problems in large applications with ease somewhat. However the effectiveness of this kind of analysis highly depends on the flow of control and data during the execution of target code - hence it is important to test the application with variety of test inputs that may maximize code coverage.

High-level steps in using Code Analyzer for Dynamic Analysis

Given the enhancements and incremental improvements in analytical tools, Solaris Studio 12.4 is recommended for this exercise.

  1. Build the application with debug flags

    –g (C) or -g0 (C++) options generate debug information. It enables Code Analyzer to display source code and line number information for errors and warnings.

    • Linux users: specify –xannotate option on compile/link line in addition to -g and other options
  2. Instrument the binary with discover

    % discover -a -H <filename>.%p.html -o <instrumented_binary> <original_binary>

    where:

    • -a : write the error data to binary-name.analyze/dynamic directory for use by Code Analyzer
    • -H : write the analysis report to <filename>.<pid>.html when the instrumented binary was executed. %p expands to the process id of the application. If you prefer the analysis report in a plain text file, use -w <filename>.%p.txt instead
    • -o : write the instrumented binary to <instrumented_binary>

    Check Command-Line Options page for the full list of discover supported options.

  3. Run the instrumented binary

    .. to collect the dynamic memory access data.

    % ./<instrumented_binary> <args>

  4. Finally examine the analysis report for errors and warnings

Example

The following example demonstrates the above steps using Solaris Studio 12.4 C compiler and discover command-line tool. Same code was used to demonstrate static analysis steps as well.

Few things to be aware of:

  • If the target application preloads one or more functions using LD_PRELOAD environment variable that discover tool need to interpose on for dynamic analysis, the resulting analysis may not be accurate.
  • If the target application uses runtime auditing using LD_AUDIT environment variable, this auditing will conflict with discover tool's use of auditing and may result in undefined behavior.

Reference & Recommended Reading:

  1. Oracle Solaris Studio 12.4 : Code Analyzer User's Guide
  2. Oracle Solaris Studio 12.4 : Discover and Uncover User's Guide

Friday Nov 28, 2014

Solaris Studio 12.4 : C/C++ Static Code Analysis

First things first -- Oracle Solaris Studio 12.4 is now generally available. One of the key features of this release is the support for the latest industry standards including C++11, C11 and OpenMP 4.0. Check the Solaris Studio 12.4 Data Sheet before downloading the software from Oracle Technology Network.

Static Code Analysis

Code Analyzer tool in Oracle Solaris Studio compiler suite can analyze static data, dynamic memory access data, and code coverage data collected from binaries that were compiled with the C/C++ compilers in Solaris Studio 12.3 or later. Code Analyzer is supported on Solaris and Oracle Enterprise Linux.

Primary focus of this blog entry is the static code analysis.

Static code analysis is the process of detecting common programming errors in code during compilation. The static code checking component in Code Analyzer looks for potential errors such as accessing outside the bounds of the array, out of scope variable use, NULL pointer deferences, infinite loops, uninitialized variables, memory leaks and double frees. The following webpage in Solaris Studio 12.4: Code Analyzer User's Guide has the complete list of errors with examples.

    Static Code Issues analyzed by Code Analyzer

High-level steps in using Code Analyzer for Static Code analysis

Given the enhancements and incremental improvements in analysis tools, Solaris Studio 12.4 is recommended for this exercise.

  1. Collect static data

    Compile [all source] and link with –xprevise=yes option.

    • when using Solaris Studio 12.3 compilers, compile with -xanalyze=code option.
    • Linux users: specify –xannotate option on compile/link line in addition to -xprevise=yes|-xanalyze=code.

    During compilation, the C/C++ compiler extracts static errors automatically, and writes the error information to the sub-directory in <binary-name>.analyze directory.

  2. Analyze the static data

    Two options available to analyze and display the errors in a report format.

Example

The following example demonstrates the above steps using Solaris Studio 12.4 C compiler and codean command-line tool.

Few things to be aware of:

  • compilers may not be able to detect all of the static errors in target code especially if the errors are complex.
  • some errors depend on data that is available only at runtime -- perform dynamic analysis as well.
  • some errors are ambiguous, and also might not be actual errors -- expect few false-positives.

Reference & Recommended Reading:
    Oracle Solaris Studio 12.4 Code Analyzer User's Guide

Saturday May 10, 2014

Solaris 11.2 Highlights [Part 2] in 4 Minutes or Less

Part 1: Solaris 11.2 Highlights in 6 Minutes or Less

Highlights contd.,

Package related ..

Minimal Set of System Packages

For the past few years, it is one of the hot topics -- what is the bare minimum [set of packages] needed to run applications. There were a number of blog posts and few technical articles around creating minimal Solaris configurations. Finally users/customers who wish to have their OS installed with minimal set of required system packages for running most of the applications in general, can just install solaris-minimal-server package and not worry about anything else such as removing unwanted packages.

# pkg install pkg:/group/system/solaris-minimal-server

Oracle Database Pre-requisite Package

Until Solaris 11.1, it is up to the users to check the package dependencies and make sure to have those installed before attempting to install Oracle database software especially using graphic installer. Solaris 11.2 frees up the users from the burden of checking and installing individual [required] packages by providing a brand new package called oracle-rdbms-server-12cR1-preinstall. Users just need to install this package for a smoother database software installation later.

# pkg install pkg:/group/prerequisite/oracle/oracle-rdbms-server-12cR1-preinstall

Mirroring a Package Repository

11.2 provides the ability to create local IPS package repositories and keeps them in synch with the IPS package repositories hosted publicly by Oracle Corporation. The key in achieving this is the SMF service svc:/application/pkg/mirror. The following webpage has the essential steps listed on a high-level.

How to Automatically Copy a Repository From the Internet

Another enhancement is the cloning of a package repository using --clone option of pkgrecv command.

Observability related ..

Network traffic diagnostics:

A brand new command, ipstat(1M), reports IP traffic statistics.

# ipstat -?
Usage:	ipstat [-cmnrt] [-a address[,address...]] [-A address[,address...]]
[-d d|u] [-i interface[,interface...]] [-l nlines] [-p protocol[,protocol...]]
[-s key | -S key] [-u R|K|M|G|T|P] [-x opt[=val][,opt[=val]...]]

# ipstat -uM 5

SOURCE                     DEST                       PROTO    INT        BYTES
etc5mdbadm01.us.oracle.com etc2m-appadm01.us.oracle.c TCP      net8       76.3M
etc2m-appadm01.us.oracle.c etc5mdbadm01.us.oracle.com TCP      net8        0.6M
dns1.us.oracle.com         etc2m-appadm01.us.oracle.c UDP      net8        0.0M
169.254.182.76             169.254.182.77             UDP      net20       0.0M
...

Total: bytes in: 76.3M bytes out:  0.6M

Another new command, tcpstat(1M), reports TCP and UDP traffic statistics.

# tcpstat -?
Usage:	tcpstat [-cmnrt] [-a address[,...]] [-A address[,...]] [-d d|u] [-i pid[,...]] 
[-l nlines] [-p port[,...]] [-P port[,...]] [-s key | -S key] [-x opt[=val][,...]] 
[-z zonename[,...]] [interval [count]]

# tcpstat 5

ZONE         PID PROTO  SADDR             SPORT DADDR             DPORT   BYTES
global      1267 TCP    etc5mdbadm01.us.  42972 etc2m-appadm01.u     22   84.3M
global      1267 TCP    etc2m-appadm01.u     22 etc5mdbadm01.us.  42972   48.0K
global      1089 UDP    169.254.182.76      161 169.254.182.77    33436  137.0 
global      1089 UDP    169.254.182.77    33436 169.254.182.76      161   44.0 
...
...

Total: bytes in: 84.3M bytes out: 48.4K

# tcpstat -i 43982 5		<-- TCP stats for a given pid

ZONE         PID PROTO  SADDR             SPORT DADDR             DPORT   BYTES
global     43982 TCP    etc2m-appadm01.u  43524 etc5mdbadm02.us.     22   73.7M
global     43982 TCP    etc5mdbadm02.us.     22 etc2m-appadm01.u  43524   41.9K

Total: bytes in: 42.1K bytes out: 73.7M

Up until 11.1, it is not so straight-forward to figure out what process created a network endpoint -- one has to rely on a combination of commands such as netstat, pfiles or lsof and proc filesystem (/proc) to extract that information. Solaris 11.2 attempts to make it easy by enhancing the existing tool netstat(1M). Enhanced netstat(1M) shows what user, pid created and control a network endpoint. -u is the magic flag.

#  netstat -aun			<-- notice the -u flag in netstat command; and User, Pid, Command columns in the output

UDP: IPv4
   Local Address        Remote Address      User    Pid      Command       State
-------------------- -------------------- -------- ------ -------------- ----------
      *.*                                 root        162 in.mpathd      Unbound
      *.*                                 netadm      765 nwamd          Unbound
      *.55388                             root        805 picld          Idle
	...
	...

TCP: IPv4
   Local Address        Remote Address      User     Pid     Command     Swind  Send-Q  Rwind  Recv-Q    State
-------------------- -------------------- -------- ------ ------------- ------- ------ ------- ------ -----------
10.129.101.1.22      10.129.158.100.38096 root       1267 sshd           128872      0  128872      0 ESTABLISHED
192.168.28.2.49540   192.168.28.1.3260    root          0       2094176      0 1177974      0 ESTABLISHED
127.0.0.1.49118            *.*            root       2943 nmz                 0      0 1048576      0 LISTEN
127.0.0.1.1008             *.*            pkg5srv   16012 httpd.worker        0      0 1048576      0 LISTEN
	...

[x86 only] Memory Access Locality Characterization and Analysis

Solaris 11.2 introduced another brand new tool, numatop(1M), that helps in characterizing the NUMA behavior of processes and threads on systems with Intel Westmere, Sandy Bridge and Ivy Bridge processors. If not installed by default, install the numatop package as shown below.

# pkg install pkg:/diagnostic/numatop

Performance related ..

This is a grey area - so, just be informed that there are some ZFS and Oracle database related performance enhancements.

Starting with 11.2, ZFS synchronous write transactions are committed in parallel, which should help improve the I/O throughput.

Database startup time has been greatly improved in Solaris 11 releases -- it's been further improved in 11.2. Customers with databases that use hundreds of Gigabytes or Terabyte(s) of memory will notice the improvement to the database startup times. Other changes to asynchronous I/O, inter-process communication using event ports etc., help improve the performance of the recent releases of Oracle database such as 12c.

Miscellaneous ..

Java 8

Java 7 is still the default in Solaris 11.2 release, but Java 8 can be installed from the IPS package repository.

eg.,

# pkg install pkg:/developer/java/jdk-8		<-- Java Development Kit
# pkg install pkg:/runtime/java/jre-8		<-- Java Runtime

Bootable USB Media

Solaris 11.2 introduces the support for booting SPARC systems from USB media. Use Solaris Distribution Constructor (requires distribution-constructor package) to create the USB bootable media, or copy a bootable/installation image to the USB media using usbcopy(1M) and dd(1M) commands.

Oracle Hardware Management Pack

Oracle Hardware Management Pack is a set of tools that are integrated into the Solaris OS distribution, that show the existing hardware configuration, help configure hardware RAID volumes, update server firmware, configure ILOM service processor, enable monitoring the hardware using existing tools etc., Look for pkg:/system/management/hmp/hmp-* packages.

Few other interesting packages:

Parallel implementation of bzip2 : compress/pbzip2
NVM Express (nvme) utility : system/storage/nvme-utilities
Utility to administer cluster of servers : terminal/cssh

Tuesday Apr 29, 2014

Solaris 11.2 Highlights [Part 1] in 6 Minutes or Less

This is not the complete list, of course. Just a few hand-picked ones.

First things first, Solaris 11.2 beta is out.

URLs: Download | What's New in Solaris 11.2 | Information Library (documentation)

Highlights:

Zones related ..

Kernel Zones

Kernel Zones bring the ability to run a non-global/local zone at a different kernel version from the global zone and can be patched or updated independently without the need to reboot the global zone. In other words, kernel zones are independent and isolated environments with a full kernel and user environment.

Creating a Kernel Zone:

  1. If not available, install the kernel zone brand
    # pkg install brand/brand-solaris-kz
    
  2. Create and install a kernel zone using the existing zonecfg and zoneadm commands. The only difference compared to creating a non-kernel zone (the zones we have been creating for the past 10 years) is the template to be used -- by default, SYSdefault template is used. To create a kernel zone, use SYSsolaris-kz template instead.

    # zonecfg -z <zone-name> create –t SYSsolaris-kz
    # zoneadm –z <zone-name> install
    # .. continue with the rest of the steps to complete zone configuration ..
    

Kernel Zones can be used in combination with logical domains (Oracle VM for SPARC), but cannot be used in combination with other virtualization solutions such as Oracle VM VirtualBox that does not support nested virtualization.

Live Zone Re-configuration

This release (11.2) added support for the dynamic re-configuration of local zones. Now the following configuration changes do not require a zone reboot.

  • Resource controls and pools
  • Network configuration
  • Adding or removing file systems
  • Adding or removing virtual and physical devices

Read-Only Global Zones

Recent releases of Solaris have support for Immutable Non-Global Zones already. Solaris 11.2 extends the immutable zone support to Global Zones. Immutable zones will have a read-only zone root.

Make a Global Zone Read-Only/Immutable by:

# zonecfg -z global set file-mac-profile=fixed-configuration

Installing Packages across multiple Non-Global Zones from the Global Zone

  • -r option of pkg can be used to install/update/uninstall software packages into/in/from all non-global zones from the global zone.
  • Use -Z option along with -r to exclude a zone in applying the package operation. Similarly use -z along with -r to apply the intended package operation only in a specific zone

Multiple Boot Environments for Solaris 10 Zones

Multiple BE support has been extended to Solaris 10 Zones in this release. This feature is useful when performing operations such as patching within an Solaris 10 environment running on a Solaris 11 system.

CMT Aware Zones and Resource Pool Configuration

It is now possible to allocate CMT based resources -- vCPUs, Cores and Sockets, using the existing zonecfg and poolcfg commands. This is useful from performance and/or licensing point of view as it provides flexibility and control for managing licensing boundaries or dedicating hardware resources solely to a zone.

Cloud related ..

Centralized Cloud Management with OpenStack

Solaris 11.2 is the first release to incorporate a complete OpenStack distribution. OpenStack allows managing and sharing compute, network and storage resources in the data center through a centralized web portal. In other words, now administrators can set up an enterprise ready private cloud Infrastructure-as-a-Service (IaaS) environment with ease.

Check this quick How-To article out at Oracle Technology Network -- Getting Started with OpenStack on Oracle Solaris 11.2

Cloning and Disaster Recovery with Unified Archives

Unified Archives is a new native archive type that enables quick cloning for rapid application deployment in the cloud, fast and reliable disaster recovery. Both bare metal and virtual environments are supported. Check the archiveadm(1M) man page for details.

eg.,
Create a clone archive of a system
# archiveadm create ./clone.uar

Create bootable media
# archiveadm create-media ./archive.uar				/* USB image */
# archiveadm create-media -f iso <other options> ./bootarch.uar	/* ISO image */

Create a full system recovery archive
# archiveadm create --recovery ./recovery.uar

Extract information from a Unified Archive
# archiveadm info somearchive.uar

To be continued .. Stay tuned.

Monday Mar 31, 2014

[Solaris] ZFS Pool History, Writing to System Log, Persistent TCP/IP Tuning, ..

.. with plenty of examples and little comments aside.

[1] Check existing DNS client configuration

Solaris 11 and later:

% svccfg -s network/dns/client listprop config
config                      application        
config/value_authorization astring     solaris.smf.value.name-service.dns.client
config/options             astring     "ndots:2 timeout:3 retrans:3 retry:1"
config/search              astring     "sfbay.sun.com" "us.oracle.com" "oraclecorp.com" "oracle.com" "sun.com"
config/nameserver          net_address xxx.xx.xxx.xx xxx.xx.xxx.xx xxx.xx.xxx.xx

Solaris 10 and prior:

Check the contents of /etc/resolv.conf

% cat /etc/resolv.conf
search  sfbay.sun.com us.oracle.com oraclecorp.com oracle.com sun.com
options ndots:2 timeout:3 retrans:3 retry:1
nameserver      xxx.xx.xxx.xx
nameserver      xxx.xx.xxx.xx
nameserver      xxx.xx.xxx.xx

Note that /etc/resolv.conf file exists on Solaris 11.x releases too as of today.

[2] Logical domains: finding out the hostname of control domain

Use virtinfo(1M) command.

root@ppst58-cn1-app:~# virtinfo -a
Domain role: LDoms guest I/O service root
Domain name: n1d2
Domain UUID: 02ea1fbe-80f9-e0cf-ecd1-934cf9bbeffa
Control domain: ppst58-01
Chassis serial#: AK00083297

The above output shows that n1d2 domain is a guest domain, which is also an I/O domain, the service domain and a root I/O domain. Control domain is running on host ppst58-01.

Output from control domain:

root@ppst58-01:~# ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  NORM  UPTIME
primary          active     -n-cv-  UART    64    130304M  0.1%  0.1%  243d 2h 
n1d1             active     -n----  5001    448   916992M  0.2%  0.2%  3d 15h 26m
n1d2             active     -n--v-  5002    512   1T       0.0%  0.0%  3d 15h 29m

root@ppst58-01:~# virtinfo -a
Domain role: LDoms control I/O service root
Domain name: primary
Domain UUID: 19337210-285a-6ea4-df8f-9dc65714e3ea
Control domain: ppst58-01
Chassis serial#: AK00083297

[3] Administering NFS configuration

Solaris 11 and later:

Use sharectl(1M) command. Solaris 11.x releases include the sharectl administrative tool to configure and manage file-sharing protocols such as NFS, SMB, autofs.

eg.,
Display all property values of NFS:

# sharectl get nfs
servers=1024
lockd_listen_backlog=32
lockd_servers=1024
grace_period=90
server_versmin=2
server_versmax=4
client_versmin=2
client_versmax=4
server_delegation=on
nfsmapid_domain=
max_connections=-1
listen_backlog=32
..
..

# sharectl status
autofs  online client
nfs     disabled

eg.,
Modifying the nfs v4 grace period from the default 90s to 30s:

# sharectl get -p grace_period nfs
grace_period=90
# sharectl set -p grace_period=30 nfs
# sharectl get -p grace_period nfs
grace_period=30

Solaris 10 and prior:

Edit /etc/default/nfs file, and restart NFS related service(s).

[4] Examining ZFS Storage Pool command history

Solaris 10 8/07 and later releases log successful zfs and zpool commands that modify the underlying pool state. All those executed commands can be examined by running zpool history command. Because this command shows the actual zfs commands executed as they are, the 'history' feature is really useful in troubleshooting an error scenario that was resulted from executing some zfs command.

# zpool list
NAME       SIZE  ALLOC  FREE  CAP  DEDUP   HEALTH  ALTROOT
rpool      416G   152G  264G  36%  1.00x   ONLINE  -
zs3actact  848G  17.4G  831G   2%  1.00x   ONLINE  -

# zpool history -l zs3actact
History for 'zs3actact':
2014-03-19.22:02:32 zpool create -f zs3actact c0t600144F0AC6B9D2900005328B7570001d0 [user root on etc25-appadm05:global]
2014-03-19.22:03:12 zfs create zs3actact/iscsivol1 [user root on etc25-appadm05:global]
2014-03-19.22:03:33 zfs set recordsize=128k zs3actact/iscsivol1 [user root on etc25-appadm05:global]

Note that this log is enabled by default, and cannot be disabled.

[5] Modifying TCP/IP configuration parameters

Using ndd(1M) is the old way of tuning TCP/IP parameters, and still supported as of today (in Solaris 11.x releases). However using padm(1M) command is the recommended way to modify or retrieve TCP/IP Internet protocols on Solaris 11.x and later releases.

# ipadm show-prop -p max_buf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   max_buf               rw   1048576      --           1048576      128000-1073741824

# ipadm set-prop -p max_buf=2097152 tcp

# ipadm show-prop -p max_buf tcp
PROTO PROPERTY              PERM CURRENT      PERSISTENT   DEFAULT      POSSIBLE
tcp   max_buf               rw   2097152      2097152      1048576      128000-1073741824

ndd style (still valid):

# ndd -get /dev/tcp tcp_max_buf
1048576

# ndd -set /dev/tcp tcp_max_buf 2097152

# ndd -get /dev/tcp tcp_max_buf
2097152

One of the advantages of using ipadm over ndd is that the configured/tuned non-default values are persistent across reboots. In case of ndd, we have to re-apply those values either manually or by creating a Run Control script (/etc/rc*.d/S*) to make sure that the intended values are set automatically during a reboot of the system.

[6] Writing to system log from a shell script

Use logger(1) command as shown in the following example.

eg.,

# logger -p local0.warning Big Brother is watching you

# dmesg | tail -1
Mar 30 18:42:14 etc27zadm01 root: [ID 702911 local0.warning] Big Brother is watching you

Check syslog.conf(4) man page for the list of available system facilities and the severity of the condition being logged (levels).

BONUS:

[*] Forceful NFS unmount on Linux

Try the lazy unmount option (-l) on systems running Linux kernel 2.4.11 or later to forcefully unmount a filesystem that keeps throwing Device or resource busy and/or device is busy error(s).

eg.,

# umount -f /bkp
umount2: Device or resource busy
umount: /bkp: device is busy
umount2: Device or resource busy
umount: /bkp: device is busy

# umount -l /bkp
#

Wednesday Mar 26, 2014

Software Availability : Solaris Studio 12.4 Beta & ORAchk

First off, these are two unrelated softwares.

Solaris Studio 12.4 Beta

Nearly two-and-a-half years after the release of Solaris Studio 12.3, Oracle is gearing up for the next major release 12.4. In addition to the compiler and library optimizations to support the latest and greatest SPARC & Intel x64 hardware such as SPARC T5, M5, M6, Fujitsu's M10, and Intel's Ivy Bridge and Haswell line of servers, support for C++ 2011 language standard is one of the highlights of this forthcoming release. The complete list of features and enhancements in release 12.4 are documented in the What's New page.

Those who feel compelled to give the updated/enhanced compilers and tools a try, can get started right away by downloading the beta bits from the following location. This software is available for Solaris 10 & 11 running on SPARC, x86 hardware; and Linux 5 & 6 runnin g on x86/x64 hardware. Anyone can download this software for free.

     Oracle Solaris Studio 12.4 Beta Download

Don't forget to check the Release Notes out for the installation instructions, known issues, limitations and workarounds, features that were removed in this release and so on.

Here's a pointer to the documentation (preview): Oracle Solaris Studio 12.4 Information Library

Finally, should you run into any issue(s) or if you have questions about anything related, feel free to use the Solaris Studio Community Forum.




ORAchk 2.2.4 (formerly known as EXAchk)

ORAchk, the Oracle Configuration Audit Tool, enhances EXAchk tool's functionality, and replaces the existing & popular RACcheck tool. In addition to the top issues reported by users/customers, ORAchk proactively scans for known problems within Oracle Database, Sun systems (especially engineered systems) and Oracle E-Business Suite Financials.

While checking, ORAchk covers a wide range of areas such as OS kernel settings, database installations (single instance and RAC), performance, backup and recovery, storage setting, and so on.

ORAchk generated reports (mostly high level) show the system health risks with the ability to drill down into specific problems and offers recommendations specific to the environment and product configuration. Those who do not like sending this data back to Oracle should be happy to know that there is no phone home feature in this release.

Note that ORAchk is available only for the Oracle Premier Support Customers - meaning only those customers with appropriate support contracts can use this tool. So, if you are a Oracle customer with the ability to access the Oracle Support website, check the following pages out for additional information.

     ORAchk - Oracle Configuration Audit Tool
     ORAchk user's guide

Feel free to use the community forum to ask any related questions.

Friday Feb 28, 2014

[Solaris] Changing hostname, Parallel Compression, pNFS, Upgrading SRUs and Clearing Faults

[1] Solaris 11+ : changing hostname

Starting with Solaris 11, a system's identify (nodename) is configured through the config/nodename service property of the svc:/system/identity:node SMF service. Solaris 10 and prior versions have this information in /etc/nodename configuration file.

The following example demonstrates the commands to change the hostname from "ihcm-db-01" to "ehcm-db-01".

eg.,
# hostname
ihcm-db-01

# svccfg -s system/identity:node listprop config
config                       application        
config/enable_mapping       boolean     true
config/ignore_dhcp_hostname boolean     false
config/nodename             astring     ihcm-db-01
config/loopback             astring     ihcm-db-01
#

# svccfg -s system/identity:node setprop config/nodename="ehcm-db-01"

# svccfg -s system/identity:node refresh  -OR- 
	# svcadm refresh svc:/system/identity:node
# svcadm restart system/identity:node

# svccfg -s system/identity:node listprop config
config                       application        
config/enable_mapping       boolean     true
config/ignore_dhcp_hostname boolean     false
config/nodename             astring     ehcm-db-01
config/loopback             astring     ehcm-db-01

# hostname
ehcm-db-01

[2] Parallel Compression

This topic is not Solaris specific, but certainly helps Solaris users who are frustrated with the single threaded implementation of all officially supported compression tools such as compress, gzip, zip.

pigz (pig-zee) is a parallel implementation of gzip that suits well for the latest multi-processor, multi-core machines. By default, pigz breaks up the input into multiple chunks of size 128 KB, and compress each chunk in parallel with the help of light-weight threads. The number of compress threads is set by default to the number of online processors. The chunk size and the number of threads are configurable.

Compressed files can be restored to their original form using -d option of pigz or gzip tools. As per the man page, decompression is not parallelized out of the box, but may show some improvement compared to the existing old tools.

The following example demonstrates the advantage of using pigz over gzip in compressing and decompressing a large file.

eg.,

Original file, and the target hardware.

$ ls -lh PT8.53.04.tar 
-rw-r--r--   1 psft     dba         4.8G Feb 28 14:03 PT8.53.04.tar

$ psrinfo -pv
The physical processor has 8 cores and 64 virtual processors (0-63)
  The core has 8 virtual processors (0-7)
	...
  The core has 8 virtual processors (56-63)
    SPARC-T5 (chipid 0, clock 3600 MHz)

gzip compression.

$ time gzip --fast PT8.53.04.tar 

real    3m40.125s
user    3m27.105s
sys     0m13.008s

$ ls -lh PT8.53*
-rw-r--r--   1 psft     dba         3.1G Feb 28 14:03 PT8.53.04.tar.gz

/* the following prstat, vmstat outputs show that gzip is compressing the 
	tar file using a single thread - hence low CPU utilization. */

$ prstat -p 42510

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP      
 42510 psft     2616K 2200K cpu16    10    0   0:01:00 1.5% gzip/1

$ prstat -m -p 42510

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
 42510 psft      95 4.6 0.0 0.0 0.0 0.0 0.0 0.0   0  35  7K   0 gzip/1

$ vmstat 2

 r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id
 0 0 0 776242104 917016008 0 7 0 0 0  0  0  0  0 52 52 3286 2606 2178  2  0 98
 1 0 0 776242104 916987888 0 14 0 0 0 0  0  0  0  0  0 3851 3359 2978  2  1 97
 0 0 0 776242104 916962440 0 0 0 0 0  0  0  0  0  0  0 3184 1687 2023  1  0 98
 0 0 0 775971768 916930720 0 0 0 0 0  0  0  0  0 39 37 3392 1819 2210  2  0 98
 0 0 0 775971768 916898016 0 0 0 0 0  0  0  0  0  0  0 3452 1861 2106  2  0 98

pigz compression.

$ time ./pigz PT8.53.04.tar 

real    0m25.111s	<== wall clock time is 25s compared to gzip's 3m 27s
user    17m18.398s
sys     0m37.718s

/* the following prstat, vmstat outputs show that pigz is compressing the 
        tar file using many threads - hence busy system with high CPU utilization. */

$ prstat -p 49734

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP      
49734 psft       59M   58M sleep    11    0   0:12:58  38% pigz/66

$ vmstat 2

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id
 0 0 0 778097840 919076008 6 113 0 0 0 0 0  0  0 40 36 39330 45797 74148 61 4 35
 0 0 0 777956280 918841720 0 1 0 0 0  0  0  0  0  0  0 38752 43292 71411 64 4 32
 0 0 0 777490336 918334176 0 3 0 0 0  0  0  0  0 17 15 46553 53350 86840 60 4 35
 1 0 0 777274072 918141936 0 1 0 0 0  0  0  0  0 39 34 16122 20202 28319 88 4 9
 1 0 0 777138800 917917376 0 0 0 0 0  0  0  0  0  3  3 46597 51005 86673 56 5 39

$ ls -lh PT8.53.04.tar.gz 
-rw-r--r--   1 psft     dba         3.0G Feb 28 14:03 PT8.53.04.tar.gz

$ gunzip PT8.53.04.tar.gz 	<== shows that the pigz compressed file is 
                                         compatible with gzip/gunzip

$ ls -lh PT8.53*
-rw-r--r--   1 psft     dba         4.8G Feb 28 14:03 PT8.53.04.tar

Decompression.

$ time ./pigz -d PT8.53.04.tar.gz 

real    0m18.068s
user    0m22.437s
sys     0m12.857s

$ time gzip -d PT8.53.04.tar.gz 

real    0m52.806s <== compare gzip's 52s decompression time with pigz's 18s
user    0m42.068s
sys     0m10.736s

$ ls -lh PT8.53.04.tar 
-rw-r--r--   1 psft     dba         4.8G Feb 28 14:03 PT8.53.04.tar

Of course, there are other tools such as Parallel BZIP2 (PBZIP2), which is a parallel implementation of the bzip2 tool are worth a try too. The idea here is to highlight the fact that there are better tools out there to get the job done in a quick manner compared to the existing/old tools that are bundled with the operating system distribution.


[3] Solaris 11+ : Upgrading SRU

Assuming the package repository is set up already to do the network updates on a Solaris 11+ system, the following commands are helpful in upgrading a SRU.

  • List all available SRUs in the repository.

    # pkg list -af entire
  • Upgrade to the latest and greatest.

    # pkg update

    To find out what changes will be made to the system, try a dry run of the system update.

    # pkg update -nv
  • Upgrade to a specific SRU.

    # pkg update entire@<FMRI>

    Find the Fault Managed Resource Identifier (FMRI) string by running pkg list -af entire command.

Note that it is not so easy to downgrade SRU to a lower version as it may break the system. Should there be a need to downgrade or switch between different SRUs, relying on Boot Environments (BE) might be a good idea. Check Creating and Administering Oracle Solaris 11 Boot Environments document for details.


[4] Parallel NFS (pNFS)

Just a quick note — RFC 5661, Network File System (NFS) Version 4.1 introduced a new feature called "Parallel NFS" or pNFS, which allows NFS clients to access storage devices containing file data directly. When file data for a single NFS v4 server is stored on multiple and/or higher-throughput storage devices, using pNFS can result in significant improvement in file access performance. However Parallel NFS is an optional feature in NFS v4.1. Though there was a prototype made available few years ago when OpenSolaris was still alive, as of today, Solaris has no support for pNFS. Stay tuned for any updates from Oracle Solaris teams.

Here is an interesting write-up from one of our colleagues at Oracle|Sun (dated 2007) -- NFSv4.1's pNFS for Solaris.

(Credit to Rob Schneider and Tom Gould for initiating this topic)


[5] SPARC hardware : Check for and clear faults from ILOM

Couple of ways to check the faults using ILOM command line interface.

By running:

  1. show faulty command from ILOM command prompt, or
  2. fmadm faulty command from within the ILOM faultmgmt shell

Once found, use the clear_fault_action property with the set command to clear the fault for a FRU.

The following example checks for the faulty FRUs from ILOM faultmgmt shell, then clears it out.

eg.,

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

faultmgmtsp> fmadm faulty

------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2014-02-26/16:17:11 18c62051-c81d-c569-a4e6-e418db2f84b4 PCIEX-8000-SQ  Critical
        ...
        ...
Suspect 1 of 1
   Fault class  : fault.io.pciex.rc.generic-ue
   Certainty    : 100%
   Affects      : hc:///chassis=0/motherboard=0/cpuboard=1/chip=2/hostbridge=4
   Status       : faulted

   FRU
      Status            : faulty
      Location          : /SYS/PM1
      Manufacturer      : Oracle Corporation
      Name              : TLA,PM,T5-4,T5-8
        ...

Description : A fault has been diagnosed by the Host Operating System.

Response    : The service required LED on the chassis and on the affected
              FRU may be illuminated.

        ...

faultmgmtsp> exit

-> set /SYS/PM1 clear_fault_action=True
Are you sure you want to clear /SYS/PM1 (y/n)? y
Set 'clear_fault_action' to 'True'

Note that this procedure clears the fault from the SP but not from the host.

Tuesday Feb 25, 2014

AIX customers: Run for the Hills ..

.. or keep your cool and embrace Solaris.

When Oracle acquired Sun, IBM tried to capitalize the situation just like every other competitor Sun had – doubts raised about Oracle's ability to turn Sun's hardware business around, and Solaris customers were advised to flee SPARC. Fast forward four years .. Oracle appears to have successfully dispelled the doubts with proven long-term commitment to the Solaris/SPARC business with consistent investment and delivery on established roadmaps. Besides, Oracle has been innovating in the server space with engineered systems that are pre-integrated to reduce the cost and complexity of IT infrastructures while increasing productivity and performance.

On the other hand, judging by the recent turn of events at IBM such as selling off critical server technologies, decline in data center business, employee furloughs, layoffs etc., it appears that Big Blue has its own struggles to deal with. In any case, irrespective of what is happening at IBM, AIX customers who are contemplating to migrate to a modern operating platform that is reliable, secure, cloud-ready and offers a rich set of features to virtualize, consolidate, diagnose, debug and most importantly scale and perform, have an attractive alternative — Oracle Solaris. Act before it is too late.

Unfortunately migrating larger deployments from one platform to another is not as easy as migrating desktop users from one operating system to another. So, Oracle put together a bunch of documents to make the AIX to Solaris transition as smooth as possible for the existing AIX customers. Access the AIX-to-Solaris migration pages at:

     http://www.oracle.com/aixtosolaris
     Modernizing IBM AIX/Power to Oracle Solaris/SPARC (Oracle Technology Network)

The above pages have pointers to white papers such as IBM AIX to Oracle Solaris Technology Mapping Guide (for system admins, power users), Simplify the Migration of Oracle Database and Oracle Applications from AIX to Oracle Solaris (for DBAs, application specific admins) and IBM AIX Technologies Compared to Oracle Solaris 11 along with hands-on labs, training, blogs and other useful resources. Check those out, and use the contact information available in those pages to speak or chat with relevant Oracle team(s) who can help get started with the migration process. Good luck.

Friday Jan 31, 2014

Solaris Tips : Automounted NFS, ZFS metaslabs, utility to manage F40 cards, powertop, ..

[1] Mounting NFS on Solaris 10 and later

With a relevant entry in /etc/vfstab, the general expectation is that Solaris automatically mounts the NFS shares upon a system reboot. However users may find that NFS shares are not being auto-mounted on some of the systems running the latest update of Solaris 10 or 11. One reason for this behavior could be the use of the Secure By Default network profile, which was introduced in Solaris 10 11/06. When this networking profile is in use, numerous services including the NFS client service are disabled. For the automounting of NFS shares, we will need the NFS client service running.

The fix is to enable NFS client service along with its dependencies.

# svcs -a | grep nfs\/client
disabled       Jan_17   svc:/network/nfs/client:default

# svcadm  enable -r svc:/network/nfs/client

# svcs -a | grep nfs\/client
online         Jan_20   svc:/network/nfs/client:default

On a similar note, if you want all default services to be enabled as they were in previous Solaris releases, run the following command as privileged user. Then use svcadm(1M) to disable unwanted services.

# netservices open

To switch back to the secure by default profile, run:

# netservices limited

[2] Utility to manage Sun Flash Accelerator F40 PCIe card(s) .. ddcli

The Sun Flash Accelerator F40 PCIe Card has two sets of firmware — NAND flash controller firmware, and SAS controller firmware (host PCIe to SAS controller). Both firmware sets are updated as a single F40 firmware package using the ddcli utility. This utility can be used to locate and display information about the cards in the system, format the cards, monitor the health and extract smart logs (to assist Oracle support in debugging and resolution) for a selected F40 card.

If ddcli utility is not available on systems where the F40 PCIe cards are installed, install patch "16005846: F40 (AURA 2) SW1.1 Release fw (08.05.01.00) and cli utility update" or later version, if available. This patch can be downloaded from support.oracle.com

Note that ddcli utility can be used to service and monitor the health of Sun Flash Accelerator F80 PCIe cards too. Install patch "Patch 17860600: SW1.0 for Sun Flash Acccelerator F80" to get access to the F80 card software package.

[3] Permission denied error when changing a password

An attempt to change the password for a local user 'XYZ' fails with Permission denied error.

# passwd XYZ
New Password: ********
Re-enter new Password: ********
Permission denied

# grep passwd /etc/nsswitch.conf
passwd: files ldap

Users have the flexibility to include and access password information in/from multiple repositories such as files and nis or ldap. Per the man page of passwd(1), when a user has a password stored in one of the name services as well as a local files entry, the passwd command tries to update both. It is possible to have different passwords in the name service and local files entry. Use passwd -r to change a specific password repository.

Hence the fix is to use the -r option in this case to ignore the nsswitch.conf file sequence and update the password information in local /etc files — /etc/passwd and /etc/shadow files.

# passwd -r files XYZ
New Password: ********
Re-enter new Password: ********
passwd: password successfully changed for oracle

[4] Microstate statistics for any process

ptime -m shows the full set of microstate accounting statistics for the lifetime of a given process. prstat -m also reports the microstate process accounting information, but the displayed statistics are accumulated since last display every interval seconds.

# prstat -p 39235

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP      
 39235 psft     3585M 3320M sleep    59    0   2:23:11 0.0% java/257

# prstat -mp 39235

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
 39235 psft     0.0 0.0 0.0 0.0 0.0  87  13 0.0   0   0   1   0 java/257


# ptime -mp 39235

real 428:31:25.902644700
user  2:06:32.283801209
sys     16:37.056999418
trap        2.250539737
tflt        0.000000000
dflt        2.018347218
kflt        0.000000000
lock 96013:52:37.184929717
slp  14349:50:02.286168683
lat      3:11.510473038
stop        0.002468763

In the above example, java process with pid 39235 spent most of its time sleeping waiting to acquire locks in user space (ref: 'lock' field). It also spent a lot of time in just sleeping waiting for some work (ref: 'slp' field). User CPU time is the next major one (ref: 'user' field). The process spent a little bit of time in system space (ref: 'sys' field), waiting for CPU (ref: 'lat' field) and almost negligible amount of time in processing system traps (ref: 'trap' field) and in servicing data page faults (ref: 'dflt' field).

[5] ZFS : metaslab utilization

ZFS divides the space on each device (virtual or physical) into a number of smaller, manageable regions called metaslabs. Each metaslab is associated with a space map that holds information about the free space in that region by keeping tracking of space allocations and deallocations.

The following sample outputs show that a virtual device, u01, made up of two physical disks has 139 metaslabs. The number of segments and free/available space in each metaslab is also shown in those outputs.

# zpool list u01
NAME   SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
u01   1.09T   133G  979G  11%  1.00x  ONLINE  -

# zpool status u01
  pool: u01
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        u01                        ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            c0t5000CCA01D1DD4A4d0  ONLINE       0     0     0
            c0t5000CCA01D1DCE88d0  ONLINE       0     0     0

errors: No known data errors

# zdb -m u01

Metaslabs:
        vdev          0   ms_array         27
        metaslabs   139   offset                spacemap          free      
        ---------------   -------------------   ---------------   -------------
        metaslab      0   offset            0   spacemap     30   free    4.65M
        metaslab      1   offset    200000000   spacemap     32   free     698K
        metaslab      2   offset    400000000   spacemap     33   free    1.25M
        metaslab      3   offset    600000000   spacemap     35   free     588K
	..
	..
        metaslab     62   offset   7c00000000   spacemap      0   free       8G
        metaslab     63   offset   7e00000000   spacemap     45   free    8.00G
        metaslab     64   offset   8000000000   spacemap      0   free       8G
	...
	...
        metaslab    136   offset  11000000000   spacemap      0   free       8G
        metaslab    137   offset  11200000000   spacemap      0   free       8G
        metaslab    138   offset  11400000000   spacemap      0   free       8G

# zdb -mm u01   

Metaslabs:
        vdev          0   ms_array         27
        metaslabs   139   offset                spacemap          free      
        ---------------   -------------------   ---------------   -------------
        metaslab      0   offset            0   spacemap     30   free    4.65M
                          segments       1136   maxsize    103K   freepct    0%
        metaslab      1   offset    200000000   spacemap     32   free     698K
                          segments         64   maxsize    118K   freepct    0%
        metaslab      2   offset    400000000   spacemap     33   free    1.25M
                          segments        113   maxsize    104K   freepct    0%
        metaslab      3   offset    600000000   spacemap     35   free     588K
                          segments        109   maxsize   28.5K   freepct    0%
	...
	...

What is the purpose of this topic? Just to introduce the ZFS debugger, zdb (check the man page zdb(1M)) to the power-users who would like to dig a little deep to find answers to tough questions such as if a ZFS filesystem is fragmented.

Keywords: ZFS zdb metaslab "space map"

[6] Roles can not login directly error on Solaris 11 and later

The root account in Solaris 11 is a role. A role is just like any other user account with the exception that users with roles cannot login directly. Here is an example that shows the failure when attempted to connect directly.

login: root
Password: ********
Roles can not login directly

In this example, connecting as a normal user (who have no roles assigned) and then using su to connect as root user would succeed. This additional step is to prevent malevolent users from getting away with no accountability. Check Bart's blog post SPOTD: The Guide Book to Solaris Role-Based Access Control for some relevant information.

If security is not a primary concern, and if connecting directly as root user is desirable, simply change the root role into a user.

# rolemod -K type=normal root

This change does not affect all the users who are currently in the root role — they retain the root role. Other users who have root access can su to root or log in to the system as the root user. To remove the root role assignment from other local users, set the role to an empty string using usermod command as shown in the following example.


/* assign root role to user 'giri' */
# usermod -R root giri

# roles giri
root

/* remove the role from user 'giri' */
# usermod -R "" giri
#

Keywords: RBAC, roles

[7] Large volume sizes (> 2 TB), and maximum size of UFS filesystem

As per the Solaris System Administration Guide, the maximum size of a UFS filesystem is ~16 TB.

To create a UFS file system greater than 2 TB, use EFI disk label. The EFI label provides support for physical disks and virtual disk volumes that are greater than 2 TB in size. Refer to the disk management section in Solaris System Administration Guide to find out the advantages and limitations of EFI.

Note that ZFS labels disks with an EFI label when creating a ZFS storage pool (zpool). And users in general need not be too concerned about the maximum size of a ZFS filesystem as it is several times larger than the maximum size supported by the UFS filesystem.

[8] powertop to observe the CPU power management

Although powertop was ported to Solaris and available as an add-on package from unofficial sources for the past few years, recent releases of Solaris bundled this tool with the core distribution. powertop can be used to monitor the effectiveness of CPU power management features on systems running Solaris. It also displays the clock frequently at which the CPU is operating along with the top events that are causing the CPU to wake up and use more energy.

Be aware that when the CPU power management is enabled with the elastic policy in effect (default on Solaris 11 and later), the CPUs on the system are susceptible to CPU throttling under certain conditions either to conserve power or to reduce the amount of heat generated by the chip. In other words, based on the load on the system, the frequency of a microprocessor can be automatically adjusted on the fly. This is referred as "CPU dynamic voltage and frequency scaling" (DVFS). Monitoring the output of powertop is one way to monitor the frequency levels of the processor on a busy system in order to minimize any performance related surprises. Set the power management policy to performance, if letting CPUs run at full speed all the time is desired. Performance policy effectively disables the CPU power management.

Power management settings can be controlled from the Service Processor's (SP) Integrated Lights Out Manager (ILOM) command line interface or browser user interface.

The following sample is gathered from an idle SPARC T5-8 server where the CPU power management was disabled.

                                                    Solaris PowerTOP version 1.3

Idle Power States       Avg     Residency             	Frequency Levels
C0 (cpu running)                (0.1%)                	 500 Mhz        0.0%
C1                      4.7ms   (99.9%)               	 800 Mhz        0.0%
                                                      	 933 Mhz        0.0%
                                                      	1067 Mhz        0.0%
                                                      	1200 Mhz        0.0%
							  ..
							  ..
                                                	3200 Mhz        0.0%
                                                	3333 Mhz        0.0%
                                                	3467 Mhz        0.0%
                                                	3600 Mhz      100.0%

Wakeups-from-idle per second: 109818.7  interval: 5.0s
no power usage estimate available

Top causes for wakeups:
94.4% (103630.7)               sched :  <xcalls> unix`dtrace_sync_func
 3.1% (3352.8)              OPMNPing :  <xcalls> unix`setsoftint_tl1
 1.1% (1155.6)                 sched :  <xcalls> unix`setsoftint_tl1
 0.4% (401.2)               <kernel> :  genunix`pm_timer
 0.3% (317.0)                  sched :  <xcalls> 
 0.2% (251.8)               <kernel> :  genunix`lwp_timer_timeout
 0.2% (204.4)                  sched :  <xcalls> unix`null_xcall
 0.1% (100.2)               <kernel> :  genunix`clock
 0.1% ( 65.6)               <kernel> :  genunix`cv_wakeup
 0.0% ( 50.2)               <kernel> :  SDC`sysdc_update
 0.0% ( 46.8)            <interrupt> :  mcxnex#0 
 0.0% ( 39.6)                   opmn :  <xcalls> unix`setsoftint_tl1
 0.0% ( 36.6)                   opmn :  <xcalls> 
 0.0% ( 36.4)                   opmn :  <xcalls> unix`vtag_flushrange_group_tl1
 0.0% ( 21.6)            <interrupt> :  ixgbe#0
	...
	...

Suggestion: enable CPU power management using poweradm(1m)

Q - Quit R - Refresh (CPU PM is disabled)
About

Benchmark announcements, HOW-TOs, Tips and Troubleshooting

Search

Archives
« April 2015
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  
       
Today