Friday Jan 31, 2014

Solaris Tips : Automounted NFS, ZFS metaslabs, utility to manage F40 cards, powertop, ..

[1] Mounting NFS on Solaris 10 and later

With a relevant entry in /etc/vfstab, the general expectation is that Solaris automatically mounts the NFS shares upon a system reboot. However users may find that NFS shares are not being auto-mounted on some of the systems running the latest update of Solaris 10 or 11. One reason for this behavior could be the use of the Secure By Default network profile, which was introduced in Solaris 10 11/06. When this networking profile is in use, numerous services including the NFS client service are disabled. For the automounting of NFS shares, we will need the NFS client service running.

The fix is to enable NFS client service along with its dependencies.

# svcs -a | grep nfs\/client
disabled       Jan_17   svc:/network/nfs/client:default

# svcadm  enable -r svc:/network/nfs/client

# svcs -a | grep nfs\/client
online         Jan_20   svc:/network/nfs/client:default

On a similar note, if you want all default services to be enabled as they were in previous Solaris releases, run the following command as privileged user. Then use svcadm(1M) to disable unwanted services.

# netservices open

To switch back to the secure by default profile, run:

# netservices limited

[2] Utility to manage Sun Flash Accelerator F40 PCIe card(s) .. ddcli

The Sun Flash Accelerator F40 PCIe Card has two sets of firmware — NAND flash controller firmware, and SAS controller firmware (host PCIe to SAS controller). Both firmware sets are updated as a single F40 firmware package using the ddcli utility. This utility can be used to locate and display information about the cards in the system, format the cards, monitor the health and extract smart logs (to assist Oracle support in debugging and resolution) for a selected F40 card.

If ddcli utility is not available on systems where the F40 PCIe cards are installed, install patch "16005846: F40 (AURA 2) SW1.1 Release fw (08.05.01.00) and cli utility update" or later version, if available. This patch can be downloaded from support.oracle.com

Note that ddcli utility can be used to service and monitor the health of Sun Flash Accelerator F80 PCIe cards too. Install patch "Patch 17860600: SW1.0 for Sun Flash Acccelerator F80" to get access to the F80 card software package.

[3] Permission denied error when changing a password

An attempt to change the password for a local user 'XYZ' fails with Permission denied error.

# passwd XYZ
New Password: ********
Re-enter new Password: ********
Permission denied

# grep passwd /etc/nsswitch.conf
passwd: files ldap

Users have the flexibility to include and access password information in/from multiple repositories such as files and nis or ldap. Per the man page of passwd(1), when a user has a password stored in one of the name services as well as a local files entry, the passwd command tries to update both. It is possible to have different passwords in the name service and local files entry. Use passwd -r to change a specific password repository.

Hence the fix is to use the -r option in this case to ignore the nsswitch.conf file sequence and update the password information in local /etc files — /etc/passwd and /etc/shadow files.

# passwd -r files XYZ
New Password: ********
Re-enter new Password: ********
passwd: password successfully changed for oracle

[4] Microstate statistics for any process

ptime -m shows the full set of microstate accounting statistics for the lifetime of a given process. prstat -m also reports the microstate process accounting information, but the displayed statistics are accumulated since last display every interval seconds.

# prstat -p 39235

   PID USERNAME  SIZE   RSS STATE   PRI NICE      TIME  CPU PROCESS/NLWP      
 39235 psft     3585M 3320M sleep    59    0   2:23:11 0.0% java/257

# prstat -mp 39235

   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
 39235 psft     0.0 0.0 0.0 0.0 0.0  87  13 0.0   0   0   1   0 java/257


# ptime -mp 39235

real 428:31:25.902644700
user  2:06:32.283801209
sys     16:37.056999418
trap        2.250539737
tflt        0.000000000
dflt        2.018347218
kflt        0.000000000
lock 96013:52:37.184929717
slp  14349:50:02.286168683
lat      3:11.510473038
stop        0.002468763

In the above example, java process with pid 39235 spent most of its time sleeping waiting to acquire locks in user space (ref: 'lock' field). It also spent a lot of time in just sleeping waiting for some work (ref: 'slp' field). User CPU time is the next major one (ref: 'user' field). The process spent a little bit of time in system space (ref: 'sys' field), waiting for CPU (ref: 'lat' field) and almost negligible amount of time in processing system traps (ref: 'trap' field) and in servicing data page faults (ref: 'dflt' field).

[5] ZFS : metaslab utilization

ZFS divides the space on each device (virtual or physical) into a number of smaller, manageable regions called metaslabs. Each metaslab is associated with a space map that holds information about the free space in that region by keeping tracking of space allocations and deallocations.

The following sample outputs show that a virtual device, u01, made up of two physical disks has 139 metaslabs. The number of segments and free/available space in each metaslab is also shown in those outputs.

# zpool list u01
NAME   SIZE  ALLOC  FREE  CAP  DEDUP  HEALTH  ALTROOT
u01   1.09T   133G  979G  11%  1.00x  ONLINE  -

# zpool status u01
  pool: u01
 state: ONLINE
  scan: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        u01                        ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            c0t5000CCA01D1DD4A4d0  ONLINE       0     0     0
            c0t5000CCA01D1DCE88d0  ONLINE       0     0     0

errors: No known data errors

# zdb -m u01

Metaslabs:
        vdev          0   ms_array         27
        metaslabs   139   offset                spacemap          free      
        ---------------   -------------------   ---------------   -------------
        metaslab      0   offset            0   spacemap     30   free    4.65M
        metaslab      1   offset    200000000   spacemap     32   free     698K
        metaslab      2   offset    400000000   spacemap     33   free    1.25M
        metaslab      3   offset    600000000   spacemap     35   free     588K
	..
	..
        metaslab     62   offset   7c00000000   spacemap      0   free       8G
        metaslab     63   offset   7e00000000   spacemap     45   free    8.00G
        metaslab     64   offset   8000000000   spacemap      0   free       8G
	...
	...
        metaslab    136   offset  11000000000   spacemap      0   free       8G
        metaslab    137   offset  11200000000   spacemap      0   free       8G
        metaslab    138   offset  11400000000   spacemap      0   free       8G

# zdb -mm u01   

Metaslabs:
        vdev          0   ms_array         27
        metaslabs   139   offset                spacemap          free      
        ---------------   -------------------   ---------------   -------------
        metaslab      0   offset            0   spacemap     30   free    4.65M
                          segments       1136   maxsize    103K   freepct    0%
        metaslab      1   offset    200000000   spacemap     32   free     698K
                          segments         64   maxsize    118K   freepct    0%
        metaslab      2   offset    400000000   spacemap     33   free    1.25M
                          segments        113   maxsize    104K   freepct    0%
        metaslab      3   offset    600000000   spacemap     35   free     588K
                          segments        109   maxsize   28.5K   freepct    0%
	...
	...

What is the purpose of this topic? Just to introduce the ZFS debugger, zdb (check the man page zdb(1M)) to the power-users who would like to dig a little deep to find answers to tough questions such as if a ZFS filesystem is fragmented.

Keywords: ZFS zdb metaslab "space map"

[6] Roles can not login directly error on Solaris 11 and later

The root account in Solaris 11 is a role. A role is just like any other user account with the exception that users with roles cannot login directly. Here is an example that shows the failure when attempted to connect directly.

login: root
Password: ********
Roles can not login directly

In this example, connecting as a normal user (who have no roles assigned) and then using su to connect as root user would succeed. This additional step is to prevent malevolent users from getting away with no accountability. Check Bart's blog post SPOTD: The Guide Book to Solaris Role-Based Access Control for some relevant information.

If security is not a primary concern, and if connecting directly as root user is desirable, simply change the root role into a user.

# rolemod -K type=normal root

This change does not affect all the users who are currently in the root role — they retain the root role. Other users who have root access can su to root or log in to the system as the root user. To remove the root role assignment from other local users, set the role to an empty string using usermod command as shown in the following example.


/* assign root role to user 'giri' */
# usermod -R root giri

# roles giri
root

/* remove the role from user 'giri' */
# usermod -R "" giri
#

Keywords: RBAC, roles

[7] Large volume sizes (> 2 TB), and maximum size of UFS filesystem

As per the Solaris System Administration Guide, the maximum size of a UFS filesystem is ~16 TB.

To create a UFS file system greater than 2 TB, use EFI disk label. The EFI label provides support for physical disks and virtual disk volumes that are greater than 2 TB in size. Refer to the disk management section in Solaris System Administration Guide to find out the advantages and limitations of EFI.

Note that ZFS labels disks with an EFI label when creating a ZFS storage pool (zpool). And users in general need not be too concerned about the maximum size of a ZFS filesystem as it is several times larger than the maximum size supported by the UFS filesystem.

[8] powertop to observe the CPU power management

Although powertop was ported to Solaris and available as an add-on package from unofficial sources for the past few years, recent releases of Solaris bundled this tool with the core distribution. powertop can be used to monitor the effectiveness of CPU power management features on systems running Solaris. It also displays the clock frequently at which the CPU is operating along with the top events that are causing the CPU to wake up and use more energy.

Be aware that when the CPU power management is enabled with the elastic policy in effect (default on Solaris 11 and later), the CPUs on the system are susceptible to CPU throttling under certain conditions either to conserve power or to reduce the amount of heat generated by the chip. In other words, based on the load on the system, the frequency of a microprocessor can be automatically adjusted on the fly. This is referred as "CPU dynamic voltage and frequency scaling" (DVFS). Monitoring the output of powertop is one way to monitor the frequency levels of the processor on a busy system in order to minimize any performance related surprises. Set the power management policy to performance, if letting CPUs run at full speed all the time is desired. Performance policy effectively disables the CPU power management.

Power management settings can be controlled from the Service Processor's (SP) Integrated Lights Out Manager (ILOM) command line interface or browser user interface.

The following sample is gathered from an idle SPARC T5-8 server where the CPU power management was disabled.

                                                    Solaris PowerTOP version 1.3

Idle Power States       Avg     Residency             	Frequency Levels
C0 (cpu running)                (0.1%)                	 500 Mhz        0.0%
C1                      4.7ms   (99.9%)               	 800 Mhz        0.0%
                                                      	 933 Mhz        0.0%
                                                      	1067 Mhz        0.0%
                                                      	1200 Mhz        0.0%
							  ..
							  ..
                                                	3200 Mhz        0.0%
                                                	3333 Mhz        0.0%
                                                	3467 Mhz        0.0%
                                                	3600 Mhz      100.0%

Wakeups-from-idle per second: 109818.7  interval: 5.0s
no power usage estimate available

Top causes for wakeups:
94.4% (103630.7)               sched :  <xcalls> unix`dtrace_sync_func
 3.1% (3352.8)              OPMNPing :  <xcalls> unix`setsoftint_tl1
 1.1% (1155.6)                 sched :  <xcalls> unix`setsoftint_tl1
 0.4% (401.2)               <kernel> :  genunix`pm_timer
 0.3% (317.0)                  sched :  <xcalls> 
 0.2% (251.8)               <kernel> :  genunix`lwp_timer_timeout
 0.2% (204.4)                  sched :  <xcalls> unix`null_xcall
 0.1% (100.2)               <kernel> :  genunix`clock
 0.1% ( 65.6)               <kernel> :  genunix`cv_wakeup
 0.0% ( 50.2)               <kernel> :  SDC`sysdc_update
 0.0% ( 46.8)            <interrupt> :  mcxnex#0 
 0.0% ( 39.6)                   opmn :  <xcalls> unix`setsoftint_tl1
 0.0% ( 36.6)                   opmn :  <xcalls> 
 0.0% ( 36.4)                   opmn :  <xcalls> unix`vtag_flushrange_group_tl1
 0.0% ( 21.6)            <interrupt> :  ixgbe#0
	...
	...

Suggestion: enable CPU power management using poweradm(1m)

Q - Quit R - Refresh (CPU PM is disabled)

Saturday Aug 31, 2013

[Oracle Database] Unreliable AWR reports on T5 & Redo logs on F40 PCIe Cards

(1) AWR report shows bogus wait events and times on SPARC T5 servers

Here is a sample from one of the Oracle 11g R2 databases running on a SPARC T5 server with Solaris 11.1 SRU 7.5

Top 5 Timed Foreground Events

Event Waits Time(s) Avg wait (ms) % DB time Wait Class
latch: cache buffers chains 278,727 812,447,335 2914850 13307324.15Concurrency
library cache: mutex X212,595449,966,33021165427370136.56Concurrency
buffer busy waits219,844349,975,25115919255732352.01Concurrency
latch: In memory undo latch25,46837,496,8001472310614171.59Concurrency
latch free2,60224,998,5839607449409459.46Other

Reason:
Unknown. There is a pending bug 17214885 - Implausible top foreground wait times reported in AWR report.

Tentative workaround:
Disable power management as shown below.

# poweradm set administrative-authority=none

# svcadm disable power
# svcadm enable power

Verify the setting by running poweradm list.

Also disable NUMA I/O object binding by setting the following parameter in /etc/system (requires a system reboot).

set numaio_bind_objects=0

Oracle Solaris 11 added support for NUMA I/O architecture. Here is a brief explanation of NUMA I/O from Solaris 11 : What's New web page.

Non-Uniform Memory Access (NUMA) I/O : Many modern systems are based on a NUMA architecture, where each CPU or set of CPUs is associated with its own physical memory and I/O devices. For best I/O performance, the processing associated with a device should be performed close to that device, and the memory used by that device for DMA (Direct Memory Access) and PIO (Programmed I/O) should be allocated close to that device as well. Oracle Solaris 11 adds support for this architecture by placing operating system resources (kernel threads, interrupts, and memory) on physical resources according to criteria such as the physical topology of the machine, specific high-level affinity requirements of I/O frameworks, actual load on the machine, and currently defined resource control and power management policies.

Do not forget to rollback these changes after applying the fix for the database bug 17214885, when available.

(2) Redo logs on F40 PCIe cards (non-volatile flash storage)

Per the F40 PCIe card user's guide, the Sun Flash Accelerator F40 PCIe Card is designed to provide best performance for data transfers that are multiples of 8k size, and using addresses that are 8k aligned. To achieve optimal performance, the size of the read/write data should be an integer multiple of this block size and the data transferred should be block aligned. I/O operations that are not block aligned and that do not use sizes that are a multiple of the block size may suffer performance degration, especially for write operations.

Oracle redo log files default to a block size that is equal to the physical sector size of the disk, typically 512 bytes. And most of the time, database writes to the redo log in a normal functioning environment. Oracle database supports a maximum block size of 4K for redo logs. Hence to achieve optimal performance for redo write operations on F40 PCIe cards, tune the environment as shown below.

  1. Configure the following init parameters
    _disk_sector_size_override=TRUE
    _simulate_disk_sectorsize=4096
    
  2. Create redo log files with 4K block size
    eg.,
    SQL> ALTER DATABASE ADD LOGFILE '/REDO/redo.dbf' size 20G blocksize 4096;
    
  3. [Solaris only] Append the following line to /kernel/drv/sd.conf (requires a reboot)
    sd-config-list="ATA     3E128-TS2-550B01","disksort:false,\
                 cache-nonvolatile:true, physical-block-size:4096";
    
  4. [Solaris only][F20] To enable maximum throughput from the MPT driver, append the following line to /kernel/drv/mpt.conf and reboot the system.
    mpt_doneq_thread_n_prop=8;
    

This tip is applicable to all kinds of flash storage that Oracle sells or sold including F20/F40 PCIe cards and F5100 storage array. sd-config-list in sd.conf may need some adjustment to reflect the correct vendor id and product id.

About

Benchmark announcements, HOW-TOs, Tips and Troubleshooting

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today